Lecture 4 Flashcards

1
Q

4.1 Why is data pre-processing and data cleaning important?

A

Measuring data quality:
• Accuracy (Correct or wrong, accurate or not)
• Completeness (Not recorded, unavailable)
• Consistency (E.g. discrepancies in representation)
• Timeliness (Updated in a timely way)
• Believability (Do I trust the data is correct?)
• Interpretability (How easily can I understand the data?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

4.2 What is the need for and what is involved in each of the major data pre-processing activities (cleaning, integration, reduction and transformation)?

A
Cleaning:
•	Incomplete (missing) data
•	Noisy data
•	Inconsistent data
•	Intentionally disguised data

Integration: Bringing data from multiple sources together.
• resolve conflicts
• detect duplicates

Reduction: Decrease the number of features or instances.
• Sampling strategies
• Remove irrelevant features and reduce noise
• Easier to visualise, faster to analyse

Transformation: Usually has to do with changing the data, e.g. scaling, normalisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

4.3 Define the terminologies: features, attributes, instances, objects.

A

Features, Attributes: Columns of data

Instances, Objects: data items or rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

4.4 What is the difference between categorical and continuous features?

A

Categorical features: Can only take certain values

Continuous features: Can take any values within a range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4.5 What are some reasons why data might be missing? What are possible causes?

A
  • Malfunction of equipment (e.g. sensors)
  • Not recorded due to misunderstanding
  • May not be considered important at time of entry
  • Deliberate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

4.6 What is the difference between data being missing randomly and data not being missing randomly?

A

Missing Randomly: Data are missing independently of observed and unobserved data.
– E.g. Coin flipping to decide whether or not to answer an exam question.
Missing not completely at random: Missing data has a reason for being missing
-E.g. set an exam question in a hard to understand language. People who don’t answer probably don’t understand that language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

4.7 What are the advantages/ disadvantages of the following strategies to handling missing data: delete all instances with a missing value?

A

+ Easy to analyse the new complete dataset
- May produce a bias in analysing if the sample size is small or a structure exists in the missing data (not missing at random for e.g.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

4.7 What are the advantages/ disadvantages of the following strategies to handling missing data: manual correction?

A

+ Can use domain knowledge to get a close approximation of true value

  • Time consuming
  • Bad for large data sets
  • Depends on level of domain knowledge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

4.7 What are the strategies to handling missing data?

A
  1. Delete all instances with a missing value:
  2. Manual correction: A human eyeballs the missing value and fills it in using their expert knowledge
  3. Imputation: Replace the missing value with a substitute one. After imputing all missing values, can use standard analysis techniques for complete datasets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

4.8 What are the advantages/ disadvantages of the following strategies for imputation of missing values: fill in with zeros?

A

Fill in with zeros
+ Simple
+ Won’t break application programs
- Limited utility for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

4.8 What are the advantages/ disadvantages of the following strategies for imputation of missing values: fill in with mean/median value?

A

Fill in with mean/median value
+ Can be good for supervised classification
+ Apply separately to each attribute
– Reduces the variance of the feature
– Incorrect view of the distribution of that attribute
– Relationships to other features changes

Note:
• Can also use median instead of mean (if distribution is skewed)
• Use mode (most frequent value) imputation for categorical features)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

4.8 What are the advantages/ disadvantages of the following strategies for imputation of missing values: fill in with category mean?

A

Fill in with category mean
+ Helps preserve relationships between categories that you take the mean of.
– Creates relationships between categories that you take the mean of if these didn’t exist previously.
– Just because you have more information, it doesn’t mean it’s going to be helpful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

4.8 What are the strategies for imputation of missing values?

A
  1. fill in with zeros
  2. fill in with mean/median value
  3. fill in with category mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly