Lecture 4 Flashcards

Question 1

Q

4.1 Why is data pre-processing and data cleaning important?

Answer

A

Measuring data quality:
• Accuracy (Correct or wrong, accurate or not)
• Completeness (Not recorded, unavailable)
• Consistency (E.g. discrepancies in representation)
• Timeliness (Updated in a timely way)
• Believability (Do I trust the data is correct?)
• Interpretability (How easily can I understand the data?)

Question 2

Q

4.2 What is the need for and what is involved in each of the major data pre-processing activities (cleaning, integration, reduction and transformation)?

Answer

A

Cleaning:
•	Incomplete (missing) data
•	Noisy data
•	Inconsistent data
•	Intentionally disguised data

Integration: Bringing data from multiple sources together.
• resolve conflicts
• detect duplicates

Reduction: Decrease the number of features or instances.
• Sampling strategies
• Remove irrelevant features and reduce noise
• Easier to visualise, faster to analyse

Transformation: Usually has to do with changing the data, e.g. scaling, normalisation

Question 3

Q

4.3 Define the terminologies: features, attributes, instances, objects.

Answer

A

Features, Attributes: Columns of data

Instances, Objects: data items or rows

Question 4

Q

4.4 What is the difference between categorical and continuous features?

Answer

A

Categorical features: Can only take certain values

Continuous features: Can take any values within a range

Question 5

Q

4.5 What are some reasons why data might be missing? What are possible causes?

Answer

A

Malfunction of equipment (e.g. sensors)
Not recorded due to misunderstanding
May not be considered important at time of entry
Deliberate

Question 6

Q

4.6 What is the difference between data being missing randomly and data not being missing randomly?

Answer

A

Missing Randomly: Data are missing independently of observed and unobserved data.
– E.g. Coin flipping to decide whether or not to answer an exam question.
Missing not completely at random: Missing data has a reason for being missing
-E.g. set an exam question in a hard to understand language. People who don’t answer probably don’t understand that language

Question 7

Q

4.7 What are the advantages/ disadvantages of the following strategies to handling missing data: delete all instances with a missing value?

Answer

A

+ Easy to analyse the new complete dataset
- May produce a bias in analysing if the sample size is small or a structure exists in the missing data (not missing at random for e.g.)

Question 8

Q

4.7 What are the advantages/ disadvantages of the following strategies to handling missing data: manual correction?

Answer

A

+ Can use domain knowledge to get a close approximation of true value

Time consuming
Bad for large data sets
Depends on level of domain knowledge

Question 9

Q

4.7 What are the strategies to handling missing data?

Answer

A

Delete all instances with a missing value:
Manual correction: A human eyeballs the missing value and fills it in using their expert knowledge
Imputation: Replace the missing value with a substitute one. After imputing all missing values, can use standard analysis techniques for complete datasets

Question 10

Q

4.8 What are the advantages/ disadvantages of the following strategies for imputation of missing values: fill in with zeros?

Answer

A

Fill in with zeros
+ Simple
+ Won’t break application programs
- Limited utility for analysis

Question 11

Q

4.8 What are the advantages/ disadvantages of the following strategies for imputation of missing values: fill in with mean/median value?

Answer

A

Fill in with mean/median value
+ Can be good for supervised classification
+ Apply separately to each attribute
– Reduces the variance of the feature
– Incorrect view of the distribution of that attribute
– Relationships to other features changes

Note:
• Can also use median instead of mean (if distribution is skewed)
• Use mode (most frequent value) imputation for categorical features)

Question 12

Q

4.8 What are the advantages/ disadvantages of the following strategies for imputation of missing values: fill in with category mean?

Answer

A

Fill in with category mean
+ Helps preserve relationships between categories that you take the mean of.
– Creates relationships between categories that you take the mean of if these didn’t exist previously.
– Just because you have more information, it doesn’t mean it’s going to be helpful

Question 13

Q

4.8 What are the strategies for imputation of missing values?

Answer

A

fill in with zeros
fill in with mean/median value
fill in with category mean

Lecture 4 Flashcards

(13 cards)