Machine Learning Pipeline Design + Param and Architecture Tuning Flashcards
(17 cards)
What are the full machine learning pipeline stages?
Data Collection –> Data Cleaning –> Feature Engineering –> Model Selection –> Hyperparameter tuning –> Model Training –> Model Evaluation –> Deployment –> Monitoring
What is the distinction between feature engineering and data cleaning?
Data cleaning is like “fixing” the data.
I.e. Removing duplicates, handling missing values, fixing inconsistent formatting, correcting data types, outlier removal
Feature engineering is adding “signal” to the data. Transforming the data to create better predictors.
I.e. encoding categorical vars, creating aggregate features, extracting features from text/images, normalization
What are the types of ways to handle missing data? When should you use each?
- Fill missing values with a constant: Null/0/-1 etc.
- You would want to do this when you want the model itself to handle missingness. (Tree based models like XGBoost handle nulls relatively)
- it’s very simple and fast, and you’re model may learn from this. However be careful it may distort statistical relationships - Impute the missing values
- Fill in the missing values with some kind of educated guess via mean, mode, model-based, etc.
- Should use when missingness is likely random, or mildly dependent on other variables.
- However you can introduce bias, as imputed values would not reflect true data. - Drop missing values
- You would use when missingness is non-informative, very few rows are missing, or when lots are missing, then you could drop col.
- loss of data, but also simple and safe.
What is data leakage? What are the 2 types?
When information outside of the training dataset is accidentally used to train the model.
- Target Leakage - model has access to information that is not available at prediction time, but is correlated with the target
- Train-test contamination - Information from the validation/test set leaks into draining, usually through preprocessing or data splits.
How do we know data leakage is occuring?
- Check validation/test accuracy - is it too good?
- Analyze feature importance - If some features are very highly correlated, check the pipeline for data leakage
- If we want to be thorough, go through entire data pipeline ensuring leakage is not occurring
How do we mitigate data leakage?
- Drop any features that contain target info – i.e. only use features that exist at prediction time
- Rebuild feature engineering pipeline after the train-test split
- re-train
How can we ensure that data leakage will not occur?
- Strictly isolate train-test splits transformations. Never apply any statistical operations before splitting.
- Use versioned feature stores, i.e tracking feature versions
- Use ML pipelines
- Enforce access control on production data
Why are imbalanced datasets an issue?
- When label classes are imbalanced, depending on the extent to which the imbalance is, the model can “cheat” by only predicting the majority class.
- Accuracy also becomes meaningless as a metric
How can we deal with imbalanced datasets?
- Oversampling –> duplicating or synthesizing more samples from the minority class.
i.e. SMOTE - Undersampling –> randomly remove samples from the majority class
What kinds of metrics would you use for imbalanced datasets?
Recall, F1, and PR-AUC as these are all imbalance aware metrics.
What are the methods of hyperparameter tuning?
Grid Search: systematically tries all combinations of hyperparameter values in a grid
Random Search: Randomly Samples hyperparameter combinations
What are the key hyperparameters used in deep learning algorithms for optimization?
learning rate - step size in weight updates, usually denoted by alpha
Optimizer - algorithm used to minimize the loss i.e. Adam, SGD, RMSprop, etc
momentum - helps smooth updates
Weight-decay - regularization term (l2 or l1)
What are the key hyperparameters used in deep learning algorithms for architecture?
- number of layers –> determines the model capacity, if too shallow, could underfit, if too deep, could overfit
- Activation functions (for non-linearity)
- Dropout rate (regularization)
What is precision?
Of all positive predictions, how many were correct?
True Positives
_________________________________
True Positives + False Positives
What is recall?
Of all actual positives, how many were identified?
True Positives
_________________________________
True Positives + False Negatives
How do we mitigate overfitting?
- Simplify the model
i.e. for polynomial regression, reduce degree of polynomial, for NN, reduce number of trainable parameters - Use regularization to penalize the wights
- Add more data, so we overfit to a more encapsulating dataset
- Use ensemble learning
How do we mitigate underfitting?
- Increase complexity of the model
i.e. more layers for NN, higher degree polynomial for polynomial regression - Train longer –> more epochs
- Reduce regularization
- Remove noise, to have a cleaner dataset