3. Feature Engineering Flashcards

1
Q

What is feature engineering?

A

The process of transforming raw data to useful features for model training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the primary reasons for data transformation?

A

Data compatibility, i.e., from sting type data to numerical data
Data quality, i.e., convert text to lowercase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the approaches for consistent data preprocessing?

A

Pretraining data transformation: data transformation before training. Adv: only perform once.
Disadv: update requires rerun the whole dataset.
Inside model data transformation: transformation is a part of the model code.
Adv: easy to decouple data and transformation.
Disadv: increase model latency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is encoding for structured data types?

A

Categorical data must be converted to numerical as most models can’t handle categorical data directly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two kinds of transformation may be needed for integer of floating-point data?

A

Normalization and bucketing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why do we need to normalize data with various ranges?

A

Slow convergence for models with gradient descent.
Wide range of values in a single feature will lead to generation of NaN error in some models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two ways of bucketing?

A

Bucketing is to transform numeric data to categorical data.
Buckets with equal-spaced boundaries
Buckets with quantile boundaries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is label encoding?

A

Convert text categories to numeric while preserving the order, e.g., small, medium, big to 1, 2, 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Out of Vocab?

A

It is a special category for outliers. ML systems won’t waste time on training the rare outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is one-hot encoding?

A

Create dummy variables used for categorical variables where order does not matter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is feature hashing?

A

Apply a hash function to a categorical feature and use the hash value as the feature name.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is hybrid of hashing and vocabulary?

A

Use vocabulary for important features
Use hashing for the less important features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is embedding?

A

Embedding is a categorical feature represented as a continuous-valued feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is feature selection?

A

Select a subset of features that are most useful to a model in order to predict the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the benefits of dimensionality reduction?

A

Reduce the noise from data
Reduce overfitting problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two ways to achieve dimensionality reduction?

A

Use feature importance
Use combinations of feature, e.g., PCA, t-SNE

13
Q

What are key outcomes of classification models?

A

True positive: Predict positive class correctly
True negative: Predict negative class correctly
False positive: Predict positive class incorrectly
False negative: Predict positive class incorrectly

13
Q

What is classification threshold?

A

It is a threshold separate positive to negative class. By default, it is set at 0.5.

14
Q

What is AUC ROC used for?

A

Balanced datasets in classification problems

15
Q

What will happen if you raise and lower the classification threshold?

A

Raise: reduce false positives, increase false negatives, increase precision
Lower: reduce false negatives, increase false positives, increase recall

16
Q

What is AUC PR used for?

A

Imbalanced datasets in classification problems

16
Q

What is AUC ROC?

A

It is a graph showing the performance of a classification model at all classification thresholds.
True positive rate against False positive rate
AUC = 1, means perfect class separation

17
Q

What is AUC PR?

A

Precision values against Recall values
It gives more attention to the minor class.

18
Q

What is feature crossing?

A

Multiply two or more features.

19
Q

What are the two ways to use feature cross?

A

Cross two features: more predictive feature
Cross two or more features: represent nonlinearity

20
Q

What is TensorFlow Data API (tf.data)?

A

It is to make data input pipeline more efficient.

21
Q

What is TensorFlow Transform?

A

TensorFlow Transform library is a part of TensorFlow Extended. It performs transformations prior to training the model.
tf.Transform can avoid training-serving skew

21
Q

What is the best practice to make an efficient data input pipeline?

A

tf.data.Dataset.interleave: It parallelizes data reading.
tf.data.Dataset.cache: Cache a dataset in memory or local storage.
tf.data.Dataset.prefetch: Make sure preprocessed is ready before training
Vectorize user-defined functions on a batch of datasets.
Apply interleave, prefetch and shuffle to reduce memory usage

22
Q

What does TensorFlow Transform do?

A

You can create transform pipelines using Cloud Dataflow.
Analyze training data
Transform training data
Transform evaluation data
Produce metadata
Feed the model
Serve data

23
Q

What are the steps and libraries used in TFX pipeline?

A

Data extraction & validation: TFDV (Dataflow)
Data transformation: TF Transform (Dataflow)
Model training & tuning: tf.Estimators & tf.Keras (Vertex AI Training)
Model evaluation & validation: TF Model Analysis (Dataflow)
Model serving for prediction: TF Serving (Vertex AI Prediction)

23
Q

What are the two tools to help data transformation?

A

Data Fusion:
Code-free UI-based managed service for ETL or ELT pipelines from various sources.
Dataprep:
Code-free UI-based serverless tool for visually exploring, cleaning, preparing structured and unstructured data for analysis, reporting and machine learning at any scale.