2. Exploring Data and Building Data Pipelines Flashcards

1
Q

What is data visualization?

A

It is a data exploratory technique to find trends and outliers in data.
It helps data cleaning and feature engineering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two ways to visualize data?

A

Univariate analysis (range & outliers)
Bivariate analysis (correlation between features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does box plot consist of?

A

minimum, 25th, 50, 75 quartiles, maximum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is line plot for?

A

It is for showing the relationship between two variables and analyze trends

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is bar plot for?

A

It is for analyzing trends and compare categorical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is scatterplot for?

A

Visualize clusters and show the relationship between two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the three measures of central tendency?

A

Mean
Median
Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which one of the three measures is affected by outliers?

A

Mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is standard deviation?

A

It is the square root of the variance.
It is a good way to identify outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is covariance?

A

It measures how much two variables vary from each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is correlation?

A

It is a normalized form of covariance ranging from -1 to +1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Can correlation be used to detect label leakage?

A

Yes, for example, hospital name shouldn’t be used as a feature as the name (cancer hospital) may give out the label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two elements determining your model quality?

A

Data Quality
Reliability (missing values, duplicate values and bad features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you make sure a dataset is reliable?

A

Check for label errors
Check for noise in features
Check for outliers and data skew

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is normalization?

A

It is to transform features to be on a similar scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does data skew mean?

A

It means the normal distribution curve is not symmetric. There are outliers.
If skewness is in the target variable, you can use oversampling or undersampling.

15
Q

What is scaling?

A

Convert floating-point feature values from their natural range into a standard range.

16
Q

What are the benefits of scaling?

A

Help gradient descent converge better in DNN
Remove NaN traps
Won’t give too much importance to features with wider ranges

16
Q

What is log scaling?

A

When some data samples are in the power of law or very large. Log will bring them to the same range.

17
Q

What is clipping?

A

Cap all features above or below a certain value.
It can be used before or after normalization.

17
Q

What is Z-score?

A

scaled value = (value - mean) / stddev
The value is calculated as standard deviations away from the mean.

18
Q

What visualization or statistical techniques can be used to detect outliers?

A

Box plots
Z-score
Clipping
Interquartile range
You can remove outliers or impute them

19
Q

What are the purposes for data analysis and exploration?

A

Lead to key insights
Define a schema

19
Q

What is TensorFlow Data Validation for?

A

Understand, validate, monitor ML data at scale to detect data and schema anomalies.

20
Q

What are the benefits of having a schema?

A

Enable metadata-driven preprocessing
Validate new data and catch anomalies

21
Q

What are the key TFX libraries?

A

TF Data Validation
TF Transform: data processing and feature engineering
TF Model Analysis: model evaluation and analysis
TF Serving: Serving models

22
Q

What are the uses of TFDV?

A

Produce a data schema
Define the baseline to detect skew or drift during training and serving.

23
Q

What are the characteristics of TFDV?

A

Built on Apache Beam for building batch and streaming pipelines. It can be run on Google Cloud Dataflow.
Dataflow is a managed service for data processing
Dataflow integrates with data warehousing serverless service, e.g., BigQuery, Cloud Storage and Vertex AI Pipelines.

24
Q

What is imbalanced data?

A

Two classes in a dataset are not equal.
You can perform oversampling or undersampling.
Or, you can do downsample and upweight the majority class. It is faster to converge.

25
Q

What is dataset splitting?

A

Training: train the model
Validation: hyperparameter tuning
Test: evaluate the performance

26
Q

How do you split dataset for online systems?

A

Split the data by time as the training data is older than the serving data.

27
Q

What are the ways to handle missing data?

A

Delete a row if it has more than one missing feature values
Delete a column if it has more than 50% missing data
Replace missing data with mean, median or mode
Replace missing data with most frequent category
Replace with last observation (last observation carried forward)
Use interpolation in time-series
Some ML algorithms can ignore missing values
Use machine learning to predict

28
Q

What is data leakage?

A

Expose test data during training
Lead to overfitting

29
Q

What are the reason for data leakage?

A

Add the target variable as your feature
Include test data in the training data
Expose information about the target variable after deployment
Apply preprocessing techniques to the entire dataset.

30
Q

What are the situation indicating data leakage?

A

The predicted output is as good as the actual output.
Features are highly correlated with the target.

31
Q

How do you prevent data leakage?

A

Select features not correlated with the target
Split data into test, train and validation sets
Preprocess training and test data separately
Use a cutoff value on time for time series.
Cross-validation when you have limited data