Machine Learning Pipeline Design + Param and Architecture Tuning Flashcards

Question 1

Q

What are the full machine learning pipeline stages?

Answer

A

Data Collection –> Data Cleaning –> Feature Engineering –> Model Selection –> Hyperparameter tuning –> Model Training –> Model Evaluation –> Deployment –> Monitoring

Question 2

Q

What is the distinction between feature engineering and data cleaning?

Answer

A

Data cleaning is like “fixing” the data.
I.e. Removing duplicates, handling missing values, fixing inconsistent formatting, correcting data types, outlier removal

Feature engineering is adding “signal” to the data. Transforming the data to create better predictors.
I.e. encoding categorical vars, creating aggregate features, extracting features from text/images, normalization

Question 3

Q

What are the types of ways to handle missing data? When should you use each?

Answer

A

Fill missing values with a constant: Null/0/-1 etc.
- You would want to do this when you want the model itself to handle missingness. (Tree based models like XGBoost handle nulls relatively)
- it’s very simple and fast, and you’re model may learn from this. However be careful it may distort statistical relationships
Impute the missing values
- Fill in the missing values with some kind of educated guess via mean, mode, model-based, etc.
- Should use when missingness is likely random, or mildly dependent on other variables.
- However you can introduce bias, as imputed values would not reflect true data.
Drop missing values
- You would use when missingness is non-informative, very few rows are missing, or when lots are missing, then you could drop col.
- loss of data, but also simple and safe.

Question 4

Q

What is data leakage? What are the 2 types?

Answer

A

When information outside of the training dataset is accidentally used to train the model.

Target Leakage - model has access to information that is not available at prediction time, but is correlated with the target
Train-test contamination - Information from the validation/test set leaks into draining, usually through preprocessing or data splits.

Question 5

Q

How do we know data leakage is occuring?

Answer

A

Check validation/test accuracy - is it too good?
Analyze feature importance - If some features are very highly correlated, check the pipeline for data leakage
If we want to be thorough, go through entire data pipeline ensuring leakage is not occurring

Question 6

Q

How do we mitigate data leakage?

Answer

A

Drop any features that contain target info – i.e. only use features that exist at prediction time
Rebuild feature engineering pipeline after the train-test split
re-train

Question 7

Q

How can we ensure that data leakage will not occur?

Answer

A

Strictly isolate train-test splits transformations. Never apply any statistical operations before splitting.
Use versioned feature stores, i.e tracking feature versions
Use ML pipelines
Enforce access control on production data

Question 8

Q

Why are imbalanced datasets an issue?

Answer

A

When label classes are imbalanced, depending on the extent to which the imbalance is, the model can “cheat” by only predicting the majority class.
Accuracy also becomes meaningless as a metric

Question 9

Q

How can we deal with imbalanced datasets?

Answer

A

Oversampling –> duplicating or synthesizing more samples from the minority class.
i.e. SMOTE
Undersampling –> randomly remove samples from the majority class

Question 10

Q

What kinds of metrics would you use for imbalanced datasets?

Answer

A

Recall, F1, and PR-AUC as these are all imbalance aware metrics.

Question 11

Q

What are the methods of hyperparameter tuning?

Answer

A

Grid Search: systematically tries all combinations of hyperparameter values in a grid

Random Search: Randomly Samples hyperparameter combinations

Question 12

Q

What are the key hyperparameters used in deep learning algorithms for optimization?

Answer

A

learning rate - step size in weight updates, usually denoted by alpha

Optimizer - algorithm used to minimize the loss i.e. Adam, SGD, RMSprop, etc

momentum - helps smooth updates

Weight-decay - regularization term (l2 or l1)

Question 13

Q

What are the key hyperparameters used in deep learning algorithms for architecture?

Answer

A

number of layers –> determines the model capacity, if too shallow, could underfit, if too deep, could overfit
Activation functions (for non-linearity)
Dropout rate (regularization)

Question 14

Q

What is precision?

Answer

A

Of all positive predictions, how many were correct?

True Positives
_________________________________
True Positives + False Positives

Question 15

Q

What is recall?

Answer

A

Of all actual positives, how many were identified?

True Positives
_________________________________
True Positives + False Negatives

Question 16

Q

How do we mitigate overfitting?

Answer

Study These Flashcards

A

Simplify the model
i.e. for polynomial regression, reduce degree of polynomial, for NN, reduce number of trainable parameters
Use regularization to penalize the wights
Add more data, so we overfit to a more encapsulating dataset
Use ensemble learning

Question 17

Q

How do we mitigate underfitting?

Answer

Study These Flashcards

A

Increase complexity of the model
i.e. more layers for NN, higher degree polynomial for polynomial regression
Train longer –> more epochs
Reduce regularization
Remove noise, to have a cleaner dataset

Machine Learning Pipeline Design + Param and Architecture Tuning Flashcards

(17 cards)