4 Cleaning and Processing Data Flashcards
What is the primary issue with duplicate data in a dataset?
Duplicate data can cause issues with skew, bias, or completely invalidate your analysis.
Define duplicate data.
Duplicate data is when a specific data point recurs multiple times within a dataset.
What is the impact of duplicate data on descriptive statistics?
It can distort averages and percentages, leading to incorrect conclusions about the dataset.
What is redundant data?
Redundant data refers to columns that can be used to perfectly predict other columns.
How does redundant data differ from duplicate data?
Duplicate data is a copy of a row, whereas redundant data is a copy of a column.
What is multicollinearity?
Multicollinearity occurs when multiple independent variables in a model are highly correlated.
What is a common approach to handle duplicate data?
The most common approach is to delete all duplicate rows.
What are some potential issues with having redundant data in a statistical model?
It can make results harder to interpret and can lead to inaccurate models when applied to the population.
What is missing data?
Missing data refers to gaps in a dataset where no information is available for certain entries.
Why is missing data problematic for data analysts?
Most analyses won’t run with null values, leading to errors and reduced statistical power.
What are the three main categories of missing data?
- Missing Completely at Random (MCAR) * Missing at Random (MAR) * Missing Not at Random (MNAR)
What does Missing Completely at Random (MCAR) mean?
Data is MCAR when there is no connection between the missing values and the present values.
What does Missing at Random (MAR) imply?
MAR means the missing data is related to another recorded variable.
Describe Missing Not at Random (MNAR).
MNAR occurs when the missing data is related to some unrecorded variable or factor.
What is a recommended practice when working with datasets?
It is generally good practice to work on a copy of your data instead of the original.
What can happen if too much redundant data is included in a dataset?
It can lead to multicollinearity, complicating the interpretation of statistical models.
Fill in the blank: Redundant data can lead to _______ in statistical models.
multicollinearity
True or False: All methods for dealing with missing data are universally accepted.
False
How can one create a subset of data excluding redundant columns?
By using functions like drop() to exclude the redundant variables.
What is the consequence of having missing data that is not random?
It can introduce bias into the results.
What is the main reason for identifying the type of missing data?
It helps determine how much the missing data will influence the outcome and potential bias.
What does MNAR stand for?
Missing Not At Random.
What is a key characteristic of MNAR data?
It has a connection to some variable or type of information that was not recorded.
Why is MNAR data considered problematic?
It is the most likely to cause bias in results.