W6: Missing Data Flashcards
Name 2 reasons why missing data is problematic.
- Results from non-missing data = biased
- Causes loss of efficiency
What is list-wise deletion?
Analysing only complete cases
E.g discard data in x and z because observation on y is missing
What is MCAR (missing completely at random)?
Missingness completely independent of the estimate of our parameter(s) of interest
E.g dog is the missingness mechanism of homework with missing values regardless of value of homework
Is there an empirical way to determine which missing data assumption is correct? (I.e MCAR / MAR / MNAR?)
No
What is 1 way to assess missing data assumptions?
Sensitivity analysis: what happens to results by comparing imputed data and list-wise deletion
What is MAR (missing at random)?
When missingness is conditionally independent of the unobserved estimate of our parameter(s) of interests
Conditionally dependent on values of variable we observed
E.g dog eats homework only if the student is female
What is MNAR (missing not at random)?
Missingness is associated with the estimate of our parameter(s) of interest
Dependent on unobserved/uncollected values of the outcome of interest
E.g dog eats bad homework, missingness = directly related to the parameters of interest
Will listwise deletion yield biased or unbiased estimates for
a) MCAR?
b) MAR?
c) MNAR
a) Unbiased
b) Possible to recover unbiased estimates if the variables missing values are conditioned on are present
But biased if only using complete cases (i.e from list wise deletion)
c) Cannot recover unbiased estimates (data needed to recover them is itself missing)
What are multiple imputations?
Estimation of new value for the missing value over multiple datasets
What is the outcome of multiple imputations?
Pooled estimates (includes sampling uncertainty (V hat) so it’s not called average estimates)
What type of uncertainty/variation is calculated in each imputed dataset?
Sampling variation, overall uncertainty estimate (V hat)
What 3 types of variation/uncertainty does the average of a set of imputed estimates (i.e. pooled estimates) have? These add up to total uncertainty
- Sampling variation (V hat)
- Between-imputed dataset variation / Missing data uncertainty
(estimate of uncertainty from changes in imputed missing data itself across imputed datasets) - Uncertainty from not generating an infinite number of imputed datasets (B/M)
What can you do to lower the level of uncertainty in multiple imputations?
Is it a good solution?
Increase m, the number of imputed datasets
No, get diminishing returns and slow analysis
How many imputed datasets are recommended?
25-100
MI steps in R (step 1)
What dataset do you start with?
Raw dataset with missing data
MI steps in R (step 2)
What are missing values filled with?
Initial estimates e.g mean, random numbers from observed data range
MI steps in R (step 3)
What model is built for each variable with missing data using other variables?
Prediction model, GLMs commonly used as these models but can use any
MI steps in R (step 4)
What is the prediction model used for?
To predict missing data
These predictions do not capture uncertainty. We only want uncertainty in imputations, not data (predictions) used to build imputation model
MI steps in R (step 5)
What steps are repeated for a complete dataset (reach convergence)?
step 2 (fill in missing values with initial estimates)
step 3 (build a prediction model based on step 2)
Are iterations and multiple imputed datasets the same thing?
No. Iterations occur at each MI dataset.
What is the problem with prediction models with GLMs when using small datasets (e.g 100 people)?
GLMs require a sample size larger than the number of predictors
In small datasets= possible to have more variables than people
What is the problem with prediction models with GLMs that specifies interactions?
GLMs only include interactions between variables when specified by analysts, and often don’t know what that needs to be
What does the margin plot (augmented scatter plot) show?
Variable X value when Variable Y is missing (conditional missingness)
Scatterplot of non missing data between 2 continuous variables with margins showing where missing data are
What are 2 ways in which identifying patterns of missing data is useful?
- Whether there’s enough overlap of relevant variables to predict missingness
- Unexpected patterns that might indicate problems with data