# Feature Engineering Flashcards

1
Q

When is a value missing not at random (MNAR)?

A

This is when the reason a value is missing is because of the true value itself. For instance, people didn’t disclose their incomes because they did not want to share their income

2
Q

When is a value missing at random (MAR)?

A

This is when the reason a value is missing is not due to the value itself, but due to another observed variable. For example gender A did not close their age, because gender A generally does not like to disclose their age

3
Q

When is a value missing completely at random (MCAR)?

A

This is when there’s not pattern in when the value is missing. For instance, people forgot to fill in the value in a survey

4
Q

What are the two ways of dealing with missing values?

A
1. Deletion
2. Imputation
5
Q

What are the types of deletion when dealing with missing values and when do you use which?

A
• Column deletion (If lots of examples are missing and you are confident it can be deleted)
• Row deletion (If MCAR and number of examples is small like 0.1%)
6
Q

What are the types of imputation when dealing with missing values?

A
• Default values (empty string)
• Mean, median, or modus
7
Q

What is feature scaling?

A

To scale features to be similar ranges

8
Q

How do you scale features to get them to be in the range [0, 1] given variable x?

A

x_scaled = (x - min(x)) - (max(x) - min(x))

9
Q

What is standardization and when should it be used in feature scaling?

A

A process to normalize features so that they have zero mean and unit variance. It should be applied to the variables, if the variables seem to follow a normal distribution. x_standardized = (x - x_mean) / standard_deviation

10
Q

What are two points of attention when applying features scaling?

A
• It’s a common source of data leakage
• It often requires global statistics. You need all data to calculate your min, max or mean. If these statistics change compared to the training, they won’t be useful
11
Q

What is discretization?

A

The process of turning a continuous feature into a discrete feature

12
Q

What is the hashing trick?

A

A hash function is used to generate a hashed value of each category. This is used to solve the problem of not knowing the number of categories in advance. A problem with hashed functions is collision, but in practice the impact on the performance is insignificant

13
Q

What is feature crossing?

A

A technique to combine two or more features to generate new features. This is useful to model the nonlinear relationships between features

14
Q

What is an embedding?

A

A vector that represents a piece of data. One of the most common uses of embeddings is word embeddings, where it’s possible to represent each word with a vector

15
Q

What is an embedding space?

A

The set of all possible embeddings generated by the same algorithm for a type of data. All embedding vectors in the same space are of the same size

16
Q

What is data leakage?

A

Refers to the phenomenon when a form of the label “leaks” into the set of features used for making predictions, and this same information is not available during inference

17
Q

What is an example of data leakage?

A

When models are found to be picking up on the text font that certain hospitals use to label their CT scans. As a result, fonts from hospitals with more serious caseloads become predictors of the given disease risk

18
Q

What are common causes of data leakage?

A
• Splitting time-correlated data randomly instead of by time
• Scaling before splitting
• Filling in missing data with statistics from the test split
• Poor handling of data duplication before splitting
• Group leakage (group of examples with correlated labels are divided into different splits)
• Leakage from data generation process
19
Q

What are two ways to detect data leakage?

A
• If a feature has unusually high correlation
• If removing a feature causes the model’s performance to deteriorate significantly
20
Q

What are the downsides of having too many features?

A
• More opportunities for data leakage
• Can cause overfitting
• Can increase memory required to serve a model
• Can increase inference latency
• Useless features become technical debt (whenever the data pipeline changes, all the affected features need to be adjusted accordingly)
21
Q

What is bagging?

A

Shortened for bootstrap aggregating. It’s an ensemble type designed to improve both the training stability and accuracy of ML algorithms. It reduces variance and helps to avoid overfitting. Bootstraps are created by sampling with replacement and then each bootstrap is trained individually. If the problem is classification, the final prediction is decided by the majority vote. If the problem is regression, the final prediction is the average of all models’ predictions

22
Q

What is boosting?

A

A type of ensemble that is a family of iterative ensemble algorithms that convert weak learners to strong ones. Each learner in this ensemble is trained on the same set of samples, but the samples are weighted differently among iterations. As a result, future weak learners focus more on the examples that previous weak learners misclassified

23
Q

What is stacking?

A

Stacking is a type of ensemble where base learners are trained from the training data then a meta-learner is created that combines the outputs of the base learners to output final predictions. The meta-learner can be as simple as a heuristic: majority vote (classification) or average vote (regression) from all base learners. It can also be another model (logistic or linear regression model)

24
Q

What are some metrics worth tracking for each experiment during its training process?

A
• The loss curve corresponding to the train split and each of the eval splits
• The model performance metrics on all non-test splits, such as accuracy, F1, perplexity
• The log of corresponding sample, prediction, and ground truth label (for ad hoc analytics and sanity checks)
• The speed of the model (number of steps per second, or number of tokens processed per second)
• System performance metrics (memory, CPU/GPU usage)
• The values over time of any (hyper)parameter whose changes can affect your model’s performance