L6 - Preprocessing ( Cleaning, Transformation, Visualisation ) Flashcards

1
Q

What are the 3 issues that require data cleaning…?

A

Missing Values
Outlier
Errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are missing values in the context of data preprocessing?

A

Data that we expect to have but is missing.

This can be due to human error, bugs etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 3 types of missing values? Define each…

A

MCAR - A value that is missing by change. The model can often account for this.

MAR - Certain data values are more likely to be missing. The reason for the missing data is related to the observed data, but not the actual observed data. For example, high wind speed breaks an air quality sensor.

MNAR - We know which values will have missing data, and the reason for the missing value is related to the actual missing data. E.g air quality sensor can’t measure b/c air is too poor quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When solving missing values, what are the 2 principles we need to keep in mind?

A
  • Prioritise data information preservation
  • Minimise bias introduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 4 solutions to missing values?

A
  • Keep as is
  • Remove rows
  • Remove columns
  • Impute values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When should we use each of the missing value solutions?

A

Keep as is - When sharing data so a collective decision can be made regarding what to do.

Remove rows - Use as a last resort when dealing with MCAR. Don’t use with MAR or MNAR due to bias introduction.

Remove columns - If miss rate of column is +25%, column can be removed.

Impute values - Replace the values with a calculated value e.g mean of the column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 3 methods for imputed values?

A

Average (mean, mode, median)

Regression - Use regression to predict missing values.

Interpolation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the purpose of imputing values?

A

To use predicted values that minimise the introduction of bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In data cleaning, what are outliers? How are they detected?

A

Anomalies in the dataset

Detected by setting quartiles. This establishes a central tendency of the data, and data outside of this area is considered an outlier and ignored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 4 possible responses to outliers? When should each be used?

A

Do nothing - Use when model is robust against outliers.

Replace outliers with upper or lower cap - Use when all data objects are needed.

Log transformation - Use when data is skewed such that there is an abnormally large deviation between size of objects.

Remove data objects with outliers - Worst option due to loss of information. Done if other methods aren’t possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 2 types of errors in Data Cleaning?

A

Random Errors - Due to inconsistency in data.

Systematic Errors - Repeatable errors that can be tracked to a source.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of Data Transformation?

A

Ensure thats data is compatible for input into model. Must have the correct Encoding and Data Ranges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the 3 methods to bring data ranges to the same scale?

A

Standardisation - Rescale data to have a mean of 0 and a std. deviation of 1. For each feature, subtract the mean and divide by the std. deviation.

Normalisation - Rescale data to be between a range, usually 0 and 1. For each feature, subtract the mean and divide by the range.

Log - Addresses skewed data from extreme values. Simply apply log function to data. Useful when we want to keep ratio of data whilst scaling it down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why and how do we perform Data Encoding?

A

To transform data to numerical form. Used for categorical data.

Transform categories into binary columns. Give each category a rank. Construct an attribute column from the rank column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Smoothing? When is it used?

A

Eliminates noise and fluctuations in data by using the average of neighbours to plot the point.

Used in Time-series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define Data Cleaning…

A

The process of detecting and correcting corrupt or inaccurate records in the data.

17
Q
A