Lecture 4 Flashcards
(13 cards)
4.1 Why is data pre-processing and data cleaning important?
Measuring data quality:
• Accuracy (Correct or wrong, accurate or not)
• Completeness (Not recorded, unavailable)
• Consistency (E.g. discrepancies in representation)
• Timeliness (Updated in a timely way)
• Believability (Do I trust the data is correct?)
• Interpretability (How easily can I understand the data?)
4.2 What is the need for and what is involved in each of the major data pre-processing activities (cleaning, integration, reduction and transformation)?
Cleaning: • Incomplete (missing) data • Noisy data • Inconsistent data • Intentionally disguised data
Integration: Bringing data from multiple sources together.
• resolve conflicts
• detect duplicates
Reduction: Decrease the number of features or instances.
• Sampling strategies
• Remove irrelevant features and reduce noise
• Easier to visualise, faster to analyse
Transformation: Usually has to do with changing the data, e.g. scaling, normalisation
4.3 Define the terminologies: features, attributes, instances, objects.
Features, Attributes: Columns of data
Instances, Objects: data items or rows
4.4 What is the difference between categorical and continuous features?
Categorical features: Can only take certain values
Continuous features: Can take any values within a range
4.5 What are some reasons why data might be missing? What are possible causes?
- Malfunction of equipment (e.g. sensors)
- Not recorded due to misunderstanding
- May not be considered important at time of entry
- Deliberate
4.6 What is the difference between data being missing randomly and data not being missing randomly?
Missing Randomly: Data are missing independently of observed and unobserved data.
– E.g. Coin flipping to decide whether or not to answer an exam question.
Missing not completely at random: Missing data has a reason for being missing
-E.g. set an exam question in a hard to understand language. People who don’t answer probably don’t understand that language
4.7 What are the advantages/ disadvantages of the following strategies to handling missing data: delete all instances with a missing value?
+ Easy to analyse the new complete dataset
- May produce a bias in analysing if the sample size is small or a structure exists in the missing data (not missing at random for e.g.)
4.7 What are the advantages/ disadvantages of the following strategies to handling missing data: manual correction?
+ Can use domain knowledge to get a close approximation of true value
- Time consuming
- Bad for large data sets
- Depends on level of domain knowledge
4.7 What are the strategies to handling missing data?
- Delete all instances with a missing value:
- Manual correction: A human eyeballs the missing value and fills it in using their expert knowledge
- Imputation: Replace the missing value with a substitute one. After imputing all missing values, can use standard analysis techniques for complete datasets
4.8 What are the advantages/ disadvantages of the following strategies for imputation of missing values: fill in with zeros?
Fill in with zeros
+ Simple
+ Won’t break application programs
- Limited utility for analysis
4.8 What are the advantages/ disadvantages of the following strategies for imputation of missing values: fill in with mean/median value?
Fill in with mean/median value
+ Can be good for supervised classification
+ Apply separately to each attribute
– Reduces the variance of the feature
– Incorrect view of the distribution of that attribute
– Relationships to other features changes
Note:
• Can also use median instead of mean (if distribution is skewed)
• Use mode (most frequent value) imputation for categorical features)
4.8 What are the advantages/ disadvantages of the following strategies for imputation of missing values: fill in with category mean?
Fill in with category mean
+ Helps preserve relationships between categories that you take the mean of.
– Creates relationships between categories that you take the mean of if these didn’t exist previously.
– Just because you have more information, it doesn’t mean it’s going to be helpful
4.8 What are the strategies for imputation of missing values?
- fill in with zeros
- fill in with mean/median value
- fill in with category mean