Machine Learning Pipeline Design + Param and Architecture Tuning Flashcards

(17 cards)

1
Q

What are the full machine learning pipeline stages?

A

Data Collection –> Data Cleaning –> Feature Engineering –> Model Selection –> Hyperparameter tuning –> Model Training –> Model Evaluation –> Deployment –> Monitoring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the distinction between feature engineering and data cleaning?

A

Data cleaning is like “fixing” the data.
I.e. Removing duplicates, handling missing values, fixing inconsistent formatting, correcting data types, outlier removal

Feature engineering is adding “signal” to the data. Transforming the data to create better predictors.
I.e. encoding categorical vars, creating aggregate features, extracting features from text/images, normalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the types of ways to handle missing data? When should you use each?

A
  1. Fill missing values with a constant: Null/0/-1 etc.
    - You would want to do this when you want the model itself to handle missingness. (Tree based models like XGBoost handle nulls relatively)
    - it’s very simple and fast, and you’re model may learn from this. However be careful it may distort statistical relationships
  2. Impute the missing values
    - Fill in the missing values with some kind of educated guess via mean, mode, model-based, etc.
    - Should use when missingness is likely random, or mildly dependent on other variables.
    - However you can introduce bias, as imputed values would not reflect true data.
  3. Drop missing values
    - You would use when missingness is non-informative, very few rows are missing, or when lots are missing, then you could drop col.
    - loss of data, but also simple and safe.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is data leakage? What are the 2 types?

A

When information outside of the training dataset is accidentally used to train the model.

  1. Target Leakage - model has access to information that is not available at prediction time, but is correlated with the target
  2. Train-test contamination - Information from the validation/test set leaks into draining, usually through preprocessing or data splits.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we know data leakage is occuring?

A
  1. Check validation/test accuracy - is it too good?
  2. Analyze feature importance - If some features are very highly correlated, check the pipeline for data leakage
  3. If we want to be thorough, go through entire data pipeline ensuring leakage is not occurring
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we mitigate data leakage?

A
  1. Drop any features that contain target info – i.e. only use features that exist at prediction time
  2. Rebuild feature engineering pipeline after the train-test split
  3. re-train
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can we ensure that data leakage will not occur?

A
  1. Strictly isolate train-test splits transformations. Never apply any statistical operations before splitting.
  2. Use versioned feature stores, i.e tracking feature versions
  3. Use ML pipelines
  4. Enforce access control on production data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why are imbalanced datasets an issue?

A
  • When label classes are imbalanced, depending on the extent to which the imbalance is, the model can “cheat” by only predicting the majority class.
  • Accuracy also becomes meaningless as a metric
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can we deal with imbalanced datasets?

A
  • Oversampling –> duplicating or synthesizing more samples from the minority class.
    i.e. SMOTE
  • Undersampling –> randomly remove samples from the majority class
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What kinds of metrics would you use for imbalanced datasets?

A

Recall, F1, and PR-AUC as these are all imbalance aware metrics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the methods of hyperparameter tuning?

A

Grid Search: systematically tries all combinations of hyperparameter values in a grid

Random Search: Randomly Samples hyperparameter combinations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the key hyperparameters used in deep learning algorithms for optimization?

A

learning rate - step size in weight updates, usually denoted by alpha

Optimizer - algorithm used to minimize the loss i.e. Adam, SGD, RMSprop, etc

momentum - helps smooth updates

Weight-decay - regularization term (l2 or l1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the key hyperparameters used in deep learning algorithms for architecture?

A
  • number of layers –> determines the model capacity, if too shallow, could underfit, if too deep, could overfit
  • Activation functions (for non-linearity)
  • Dropout rate (regularization)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is precision?

A

Of all positive predictions, how many were correct?

True Positives
_________________________________
True Positives + False Positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is recall?

A

Of all actual positives, how many were identified?

True Positives
_________________________________
True Positives + False Negatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do we mitigate overfitting?

A
  • Simplify the model
    i.e. for polynomial regression, reduce degree of polynomial, for NN, reduce number of trainable parameters
  • Use regularization to penalize the wights
  • Add more data, so we overfit to a more encapsulating dataset
  • Use ensemble learning
17
Q

How do we mitigate underfitting?

A
  • Increase complexity of the model
    i.e. more layers for NN, higher degree polynomial for polynomial regression
  • Train longer –> more epochs
  • Reduce regularization
  • Remove noise, to have a cleaner dataset