ML - Preprocessing & Evaluation Flashcards
What are the three types of error in a ML model? Briefly describe them.
- Bias - error caused by choosing an algorithm that cannot accurately model the signal in the data, i.e. the model is too general or was incorrectly selected. For example, selecting a simple linear regression to model highly non-linear data would result in error due to bias.
- Variance - error from an estimator being too specific and learning relationships that are specific to the training set but do not generalize to new samples well. Variance can come from fitting too closely to noise in the data, and models with high variance are extremely sensitive to changing inputs. Example: Creating a decision tree that splits the training set until every leaf node only contains 1 sample.
- Irreducible error (a.k.a. Noise) - error caused by raindomness or inherent noise in the data that cannot be removed through modeling. Example: inaccuracy in data collection causes irreducible error. Example: Trying to predict if someone will sneeze tomorrow. Even the best model can’t account for a rogue pollen grain.
What is the bias-variance trade-off?
Bias refers to an error from an estimator that is too general and does not learn relationships from a data set that would allow it to make better predictions.
**Variance ** refers to error from an estimator being too specific and learning relationships that are specific to the training set but will not generalize to new records well.
In short, the bias-variance trade-off is a the trade-off between underfitting and overfitting. As you decrease variance, you tend to increase bias. As you decrease bias, you tend to increase variance.
Your goal is to create models that minimize the overall error by careful model selection and tuning to ensure sure there is a balance between bias and variance: general enough to make good predictions on new data but specific enough to pick up as much signal as possible.
What are some naive approaches to classification that can be used as a baseline for results?
- Predict only the most common class: if the majority of samples have a target of 1, predict 1 for the entire validation set. This is extremely useful as a baseline for imbalanced data sets.
2.** Predict a random class:** if you have two classes, 1 and 0, randomly select either 1 or 0 for each sample in the validation set.
- Randomly draw from a distribution matching that of the target variable in the training set: if you have two classes, 70% of training samples are A and 30% of training samples are B, then you’ll randomly sample this distribution to create predictions for your validation set.
These baseline results are good to calculate at the start and you should include at least one when making any assertions about the efficacy of your model, e.g. “our model was 50% more accurate than the naive approach of suggesting all customers to buy the most popular car.”
Explain the classification metric Area Under the Curve (AUC)
AUC (Area Under the Curve) refers to the area under the ROC (Receiver Operating Characteristic) curve in a classification model. It measures the model’s ability to distinguish between classes, with a value ranging from 0 to 1. A higher AUC indicates a better performing model.
AUC is the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative one (In terms of giving it a probability of belonging to a positive class). AUC ranges from 0 to 1, where 1 indicates perfect classification and 0.5 suggests no discrimination (similar to random guessing).
Explain the classification metric Gini?
Gini is a similar metric that scales AUC between -1 and 1 so that 0 represents a model that makes random predictions. Gini = 2*AUC-1
What’s the difference between bagging and boosting?
Bagging and boosting are both ensemble methods, meaning they combine many weak predictors to create a strong predictor. One key difference is that bagging builds independent models in parallel, whereas boosting builds models sequentially, at each step emphasizing the observations that were missed in previous steps.
How can you tell if your model is underfitting your data?
If your training and validation error are both relatively equal and very high, then your model is most likely underfitting your training data.
How can you tell if your model is overfitting your data?
If your training error is low and your validation error is high, then your model is most likely overfitting your training data.
Name and briefly explain several evaluation metrics that are useful for classification problems.
- Accuracy - measures the percentage of the time you correctly classify samples: (true positive + true negative) / all samples
- Precision - measures the percentage of the predicted members that were correctly classified: true positives / (true positives + false positives)
- Recall - measures the percentage of true members that were correctly classified by the algorithm: true positives / (true positives + false negative)
- F1 - measurement that balances accuracy and precision (or you can think of it as balancing Type I and Type II error)
- AUC - describes the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
- Gini - a scale and centered version of AUC
- Log-loss - similar to accuracy but increases the penalty for incorrect classifications that are “further” away from their true class. For log-loss, lower values are better.
Name and briefly explain several evaluation metrics that are useful for regression problems.
- Means squared error (MSE) - the average of the squared error of each prediction
- Root mean squared error (RMSE) - square root of MSE
- Mean absolute error (MAE) - the average of the absolute error of each prediction
- Coefficient of determination (R^2) - proportion of variance in the target that is predictable from the features
When should you reduce the number of features used by your model?
Some instances when features selection is necessary:
• When there is strong collinearity between features
• There are an overwhelming number of features
• The is not enough computational power to process all features
• The algorithm forces the model to use all features, even when they are not useful (most often in parametric or linear models)
• When you wish to make the model simpler for any reason, e.g. easier to explain, less computational power needed, etc
When is feature selection unnecessary?
Some instances when feature selection is not necessary:
• There are relatively few features
• All features contain useful and important signal
• There is no collinearity between features
• The model will automatically select the most useful features
• The computing resources can handle processing all of the features
• Thoroughly explaining the model to a non-technical audience is not critical
What are the three types of feature selection methods?
• Filter Methods - feature selection is done independent of the learning algorithm, before any modelling is done. One example is finding the correlation between every feature and the target and throwing out those that don’t meet a threshold. Easy, fast, but naive and not as performant as other methods.
• Wrapper Methods - train models on subsets of the features and use the subset that results in the best performance. Examples are Stepwise or Recursive feature selection. Advantages are that it considers each feature in the context of the other features, but can be computationally expensive
• Embedded Methods - learning algorithms have built-in feature selection e.g. L1 regularization
What are two common ways to automate hyperparameter tuning?
- Grid search - test every possible combination of pre-defined hyperparameter values and select the best one
- Randomized search - randomly test possible combinations of pre-defined hyperparameter values and select the best tested one
What are the pros and cons of grid search?
Pros: Grid search is great when you need to fine-tune hyperparameters over a small search space automatically. For example, if you have 100 different datasets that you expect to be similar (e.g. solving the same problem repeatedly with different populations), you can use grid search to automatically fine-tune the hyperparameters for each model.
Cons: Grid search is computationally expensive and inefficient, often searching over parameter space that has very little chance of being useful, resulting it being extremely slow. It’s especially slow if you need to search a large space since it’s complexity increases exponentially as more hyperparameters are optimized.
What are the pros and cons of randomized search?
Pros: Randomized search does a good job finding near-optimal hyperparameters over a very large search space relatively quickly and doesn’t suffer from the same exponential scaling problem as grid search.
Cons: Randomized search does not fine-tune the results as much as grid search does since it typically does not test every possible combination of parameters.
What are some naive feature engineering techniques that improve model efficacy?
- Summary statistics (mean, median, mode, min, max, std) for each group of similar records, e.g. all male customers between the ages of 32 and 44 would get their own set of summary stats
- Interactions or ratios between features, e.g. var1/var2 or var1*var2
- Summaries of features, e.g. the number of purchases a customer made in the last 30 days (raw features may be last 10 purchase dates)
- Splitting feature information manually, e.g. customer taller than 6’ may be a critical piece of information when recommending car vs SUV
- kNN using records in the training set to produce a “kNN” feature that is fed into another model
What are three methods for scaling your data?
- “Normalization” or “scaling” - general terms that refer to transforming your input data to a new scale (often a linear transformation) such as to 0 to 1, -1 to 1, 0 to 10, etc
- Min-Max - linear transformation of data that maps the minimum value to 0 and the maximum value to 1
- Standardization - transforms each feature to a normal distribution with a mean of 0 and standard deviation of 1. May also be referred to as Z-score transformation
Explain one major drawback to each of the three scaling methods.
General normalization scaling is sensitive to outliers since the presence of outliers will compress most values and make them appear extremely close together.
Min-Max scaling is also sensitive to outliers since the presence of outliers will compress most values and make them appear extremely close together.
Standardization (or Z-score transformation) rescales to an unbounded interval which can be problematic for certain algorithms, e.g. some neural networks, that expect input values to be inside a specific range.
When should you scale your data? Why?
When your algorithm will weight each input, e.g. gradient descent used by many neural nets, or use distance metrics, e.g. kNN, model performance can often be improved by normalizing, standardizing, or otherwise scaling your data so that each feature is given relatively equal weight.
It is also important when features are measured in different units, e.g. feature A is measured in inches, feature B is measured in feet, and feature C is measured in dollars, that they are scaled in a way that they are weighted and/or represented equally.
In some cases, efficacy will not change but perceived feature importance may change, e.g. coefficients in a linear regression.
Scaling your data typically does not change performance or feature importance for tree-based models since the split points will simply shift to compensate for the scaled data.
Describe basic feature encoding for categorical variables.
Feature encoding involves replacing classes in a categorical variable with new values such as integers or real numbers. For example, [red, blue, green] could be encoded to [8, 5, 11].
When should you encode your features? Why?
You should encode categorical features so that they can be processed by numerical algorithms, e.g. so that machine learning algorithms can learn from them.
What are three methods for encoding categorical data?
• Label encoding (non-ordinal) - each category is assigned a numeric value not representing any ordering. Example: [red, blue, green] could be encoded to [8, 5, 11].
• Label encoding (ordinal) - each category is assigned a numeric value representing an ordering. Example: [small, medium, large] could be encoded to [1, 2, 3]
• One-hot encoding aka binary encoding - each category is transformed into a new binary feature, with all records being marked 1/True or 0/False. Example: color = [red, blue, green] could be encoded to color_red = [1, 0, 0], color_blue = [0, 1, 0], color_green = [0, 0, 1]
What are some common uses of decision tree algorithms?
- Classification
- Regression
- Measuring feature importance
- Feature selection