T10 Tutorial Flashcards
(20 cards)
Data cleaning
The goal of data analytics is to support decision making by analysing data. Therefore, good decisions should be based on good data. In reality, data can be incomplete, inconsistent and noisy. Data may have some missing or wrong attribute values or may contain aggregate data. Incomplete data is the data for which values are not available. A dataset may contain errors and outliers. For example, a data set might have a salary variable that is set to -1 when the salary was unknown. Data may contain outliers due to errors or due to not recording appropriate values.
Outliers are data objects with characteristics that are considerably different from the most of the other data objects in the data set. While noise is always an error, outliers are not always errors. Therefore, one must be careful when dealing with it. Let us consider an example: one’s age =200 which highly deviates from the rest of the data points.
Therefore, it is noise as we also know that it is not possible to have an age = 200. Now, what about the CEO’s salary? It is very high when compared to the rest of the employees. Can we consider it as an error? No, hence it is an outlier but not noise, and some special treatments are essential to deal with it. Next, we will see how to correct noisy data. There exist multiple methods and some of them are:
- Binning method
- Clustering
- Regression
- Combined computer and human inspection
Note that data modelling methods such as clustering and regression can be used in the data preparation phase to enhance the data quality.
Missing data imputation is the technique used to deal with the existence of missing values in the data. There are three possible solutions for treating missing values.
1. Ignore
2. Fill
3. Remove
Ignoring the missing record is easy but not an effective approach. If there exist many missing values, it reduces the quantity of data that can be used in the analysis. The consequence is that the model building on less quantity of data, may underfit and the robustness of the model becomes questionable. Moreover, not all machine learning algorithms can work with missing values. Filling the missing values is another approach. However, filling them manually is tedious and infeasible. Depending on the variable’s data type, a different automated filling can be applied. For example, using a global constant like “unknown”, “a new class/value” for missing values; using the variable mean to fill in the missing value if the variable is numeric or majority value if the variable is categorical. Some smarter ways of imputation include the use of the most probable value identified using advanced inference-based techniques like Bayesian formula or decision tree. This is especially useful when there is a very high number of missing values. Each of these methods can be applied to a certain situation and no one method is universally applied. Let us consider the following data.
We have two missing values. One in the age variable and the other in religion. These missing values can be imputed using aggregate functions (e.g. average) or probabilistic estimates on global value distribution. So, the missing value in age can be filled with average income or the most probable income based on the fact the person is 39 years old. On the other hand, religion cannot be filled with average as it is a categorical variable. Therefore, it can be filled with the most frequent religion.
How can you identify outliers in a noisy data?
IQR
Using boxplot
import seaborn as sns import matplotlib.pyplot as plt sns.boxplot(x=df['DemAge']) plt.show()
# Drop NaNs data = df['DemAge'].dropna() Calculate Q1 and Q3 Q1 = data.quantile(0.25) Q3 = data.quantile(0.75) Calculate IQR IQR = Q3 - Q1 Define outlier thresholds lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR Identify outliers outliers = data[(data < lower_bound) | (data > upper_bound)] print("Outliers:\n", outliers)
Calculate IQR
IQR stands for Interquartile Range, which is:
𝑄
3 −
𝑄
1
IQR=Q3−Q1
Where:
Q1 = 25th percentile
Q3 = 75th percentile
Many samples may have no recorded value for several variables, e.g., customer income in sales data, or occupation in charity data. What are the reasons for the incomplete data?
My answer:
* no collection or filling
Solution:
* Data wasn’t captured due to equipment malfunction
* Inconsistent with other recorded data and thus application program might have deleted the data
* Data not entered due to misunderstanding (I thought that you will do it)
* Certain data may not be considered important at the time of entry
* Not register history or changes of the data
Identify the data quality issues in the following data
solution:
Some of the issues are:
Noise: different values for the same class (Y and y) in cking variable.
Null values and ‘.’
Error: Given this subset, in variable NSF, 2 may be an error. We need to know more about the variable to decide if it is an error.
Outliers: ADB and bal have outlier values 89981.12 and 45662 respectively.
My answer:
+ alignment
+ missing values
+ noise ($89k, 2)
+ outlier: bal $45k
What are the various forms the missing data can appear in the input data set above? How do you deal with missing data
Forms of missing data:
<empty>
“0”
“.”
“999999” (in the form of a huge value)
“NA”, etc.,
Strategies to deal with it:
Ignore records with missing values.
Treat missing value as a separate value.
Imputation: fill in with mean or median in case of numeric with the most common value in case of nominal data.
Imputation: Use another data mining method to predict the values of the missing attributes.
---
mine:
\+ format .
\+ simply missing
\+ better than just ignoring: most frequent value, some rules
</empty>
What are the false predictors? Give an example.
- A leaky variable is a feature in your dataset that inadvertently includes information from the future or outside the training data. This can lead to a model that performs exceptionally well on training data because it has access to information it wouldn’t have in a real-world scenario. For example, if you’re trying to predict stock prices and your features include future prices as input, that would be a leaky variable because you’re using information that wouldn’t be available at the time of prediction.
- A false predictor, on the other hand, is a feature that appears to be predictive due to random chance or noise in the dataset rather than any true underlying relationship with the target variable. This can occur due to overfitting, where a model learns patterns that are not generalizable to unseen data. False predictors do not have a genuine impact on the outcome and can mislead the model’s predictions if not identified and removed.
More Examples:
* Service cancellation date is a leaker when predicting attrition rate;
* Student final grade is a leaker when predicting the success rate in a student course;
mine:
* bad source of info?
lvl3
Three different data transformation
Three different data transformation are: Discretization, Normalization, Smoothing
Data mining or machine learning models require a certain kind of format for input data to be processed. In its simplest form, data transformation can be considered as a method for converting data to be in the form acceptable by the models. However, careful implementation of data transformation can improve the data mining solution considerabl
Discretization
Discretization Discretization is the process of converting quantitative variables into categorical variables. This will make the values of the input variables into ranges (bins). For example, consider the variable age, which is a quantitative variable. This can be converted into categorical value by creating labels by grouping the values into a range; say, ‘<10’, ‘11 to 30’, ‘31 to 50’, ‘> 50’. Sometimes, by meaningfully grouping into bins, the distribution can be changed which can lead to better model performance
Normalization
Normalization is the process of scaling the values of the input variables to a fixed range (such as between 0 and 1, or 0 to 100) and calculate it in reference to a fixed point. There are many cases where normalization is important. For example, in relation to covid-19, 5000 cases in Italy are not the same thing as 5000 cases in China. One has to think about the population difference between these two countries. Another example, to compare data analysts’ salaries in different cities or countries, it would make sense to take into account the local cost of living.
The common techniques are: Min-max transformation, Z-score normalization, Decimal scaling.
Min-max normalization It normalizes an input value v into a range of [newmin, newmax]. newmin and newmax are the new range that we like to constraint the original range into. If v’ is the normalized value, it is defined as
Consider a scenario, where we convert the salary within a range of [0,1]. The salary value 𝑣 = $73,600 and the 𝑚𝑖𝑛 and 𝑚𝑎𝑥 of original variable values are $12,000 and $98,000 respectively. Mean and standard deviation are $54,000 and $16,000. Calculate the normalized value of 𝑣 using min-max, z-score, and decimal normalization method
Dimensionality reduction
Essentially data science models are about finding relationships that exist among variables. Due to the presence of many unrelated variables, this process may be negatively affected. Dimensionality reduction is applied to the dataset to reduce the number of random variables so that a data mining or machine learning process can be focused on variables that are useful for this purpose.
The curse of dimensionality
The search space increases with the addition of each dimension increasing the sparsity in the representation.
Most documents have a large number of null or zero values. The data set can be highly sparse.
The figure illustrates that when the dimensions increase, the difference among the Euclidean distances between the documents to the nearest neighbours and to the furthest neighbour becomes less distinct due to sparsity, most values are zero.
A similarity/distance computation between two objects is dominated by zero values, which would make the similarity/distance value of any pair of objects remain the same for all pairs. Therefore, classification or clustering becomes a challenging task to produce reliable result.
Lower Figure: Distance Becomes Uniform
This graph shows how Euclidean distances behave as the number of dimensions increases (log scale):
Blue: Distance to nearest neighbor
Green: Distance to 10th neighbor
Red: Distance to farthest neighbor
✅ Problems Identified:
Distance Concentration:
As dimensionality increases, the distances to the nearest and farthest neighbors converge.
This means all points start to look equally far apart, making nearest-neighbor search ineffective.
Loss of Contrast:
Algorithms like k-NN, k-means, or DBSCAN rely on distance to differentiate between data points.
With high dimensions, those differences fade, leading to poor clustering or classification accuracy.
Data partitioning
In predictive mining, it is very important to select a representative sample dataset for training and testing the models. The data partitioning process is to find data subsets that can be used as training and test datasets during modelling. The non-overlapping train and test datasets allow comparing different algorithms and different training parameters.
In data partitioning, we try to identify the amount of data that we will use for training, testing, and validation of the data mining/machine learning models on a given problem. Moreover, it allows us to build model on a smaller dataset and evaluate the model on an independent dataset. There are two main types of data partitioning methods.
Data partitioning: Batch testing
The data is split into test and train datasets by keeping a portion of the labeled samples for testing and use the remaining to train the model. For instance, a test set can be allocated with10% or 20% or any percentage of the data. Depending on the nature of the data, the percentage of the split between the test and train should be decided. For example, when the data set is small, we cannot let the training happen on 80% of data, as this will leave the testing run on a very small subset. The N-fold cross-validation scheme becomes useful to divide a dataset into multiple subsets.
Data partitioning: N-fold cross-validation
The N-fold cross-validation scheme divides the data into N equal subsets or folds and uses N- 1 folds for training and the remaining 1 fold for testing. This is repeated for N times until all the folds are distinctly used as the testing set. For example, if N is 10, it trains the model on 9 folds and tests on the remaining fold and it is repeated 10 times with a distinct test fold.
There are two ways to choose samples in the representative subsets created by a data partitioning method.
Stratified sampling is a technique used to ensure that all classes of a dataset can be adequately represented within the sample. In the context of a minority class in a dataset, stratified sampling can help to ensure that the minority class is proportionally represented in the training and test sets.
Stratified sampling for a minority class:
Divide the Data: group your dataset into clusters based on the class labels.
Determine Sample Sizes: Decide on the number of samples you need from each class.
Typical stratified sampling: you would want to ensure that the number of samples for each class is proportional to its representation in the full dataset.
Alternatively, specific for imbalanced data: you can Oversample the minority class, i.e., increase the instances of the minority class by duplicating the cases.
Please note: only oversample training data, never oversample the test set!
Sampling: For each class, randomly select the determined number of samples
c?
What is meant by the dimensionality of data?
(a) The size of the data
(b) The structure of data in a file
(c) The number of attributes or features
(d) The levels of abstractions used to represent the dat
Answer: (c)
An observation that is extreme, being distant from the rest of the data is termed a
(a) Feature
(b) Outlier
(c) Predictor
(d) Class
b
Which of the following is a good technique to evaluate the performance of a data analytics model?
(a) Sampling
(b) Parameter tuning
(c) Cross-validation
(d) Stratification
The correct answer is:
✅ (c) Cross-validation
Here’s why:
(c) Cross-validation
- Purpose: It is a robust method for evaluating the performance and generalizability of a data analytics or machine learning model.
- How it works: The data is split into multiple folds (e.g., 5 or 10). The model is trained on a subset and tested on the remaining part, repeated for each fold.
- Advantage: Reduces the risk of overfitting and gives a better estimate of model performance on unseen data.
Why the others are not correct:
(a) Sampling
- Sampling is used for data preparation, not performance evaluation.
- It may help reduce data size or balance classes, but it doesn’t directly evaluate how well a model performs.
(b) Parameter tuning
- Parameter tuning (e.g., grid search) is about optimizing model performance, not evaluating it.
- It often uses cross-validation to find the best parameters.
(d) Stratification
- Stratification is a technique used during sampling or splitting data to ensure equal representation of classes.
- It supports better evaluation but is not an evaluation method by itself.
✅ Final Answer: (c) Cross-validation