W6: Missing Data Flashcards by Val Y

Name 2 reasons why missing data is problematic.

Results from non-missing data = biased
Causes loss of efficiency

How well did you know this?

Not at all

Perfectly

What is list-wise deletion?

Analysing only complete cases
E.g discard data in x and z because observation on y is missing

How well did you know this?

Not at all

Perfectly

What is MCAR (missing completely at random)?

Missingness completely independent of the estimate of our parameter(s) of interest
E.g dog is the missingness mechanism of homework with missing values regardless of value of homework

How well did you know this?

Not at all

Perfectly

Is there an empirical way to determine which missing data assumption is correct? (I.e MCAR / MAR / MNAR?)

How well did you know this?

Not at all

Perfectly

What is 1 way to assess missing data assumptions?

Sensitivity analysis: what happens to results by comparing imputed data and list-wise deletion

How well did you know this?

Not at all

Perfectly

What is MAR (missing at random)?

When missingness is conditionally independent of the unobserved estimate of our parameter(s) of interests
Conditionally dependent on values of variable we observed
E.g dog eats homework only if the student is female

How well did you know this?

Not at all

Perfectly

What is MNAR (missing not at random)?

Missingness is associated with the estimate of our parameter(s) of interest
Dependent on unobserved/uncollected values of the outcome of interest
E.g dog eats bad homework, missingness = directly related to the parameters of interest

How well did you know this?

Not at all

Perfectly

Will listwise deletion yield biased or unbiased estimates for
a) MCAR?
b) MAR?
c) MNAR

a) Unbiased
b) Possible to recover unbiased estimates if the variables missing values are conditioned on are present
But biased if only using complete cases (i.e from list wise deletion)
c) Cannot recover unbiased estimates (data needed to recover them is itself missing)

How well did you know this?

Not at all

Perfectly

What are multiple imputations?

Estimation of new value for the missing value over multiple datasets

How well did you know this?

Not at all

Perfectly

What is the outcome of multiple imputations?

Pooled estimates (includes sampling uncertainty (V hat) so it’s not called average estimates)

How well did you know this?

Not at all

Perfectly

What type of uncertainty/variation is calculated in each imputed dataset?

Sampling variation, overall uncertainty estimate (V hat)

How well did you know this?

Not at all

Perfectly

What 3 types of variation/uncertainty does the average of a set of imputed estimates (i.e. pooled estimates) have? These add up to total uncertainty

Sampling variation (V hat)
Between-imputed dataset variation / Missing data uncertainty
(estimate of uncertainty from changes in imputed missing data itself across imputed datasets)
Uncertainty from not generating an infinite number of imputed datasets (B/M)

How well did you know this?

Not at all

Perfectly

What can you do to lower the level of uncertainty in multiple imputations?
Is it a good solution?

Increase m, the number of imputed datasets
No, get diminishing returns and slow analysis

How well did you know this?

Not at all

Perfectly

How many imputed datasets are recommended?

25-100

How well did you know this?

Not at all

Perfectly

MI steps in R (step 1)
What dataset do you start with?

Raw dataset with missing data

How well did you know this?

Not at all

Perfectly

MI steps in R (step 2)
What are missing values filled with?

Study These Flashcards

Initial estimates e.g mean, random numbers from observed data range

MI steps in R (step 3)
What model is built for each variable with missing data using other variables?

Study These Flashcards

Prediction model, GLMs commonly used as these models but can use any

MI steps in R (step 4)
What is the prediction model used for?

Study These Flashcards

To predict missing data
These predictions do not capture uncertainty. We only want uncertainty in imputations, not data (predictions) used to build imputation model

MI steps in R (step 5)
What steps are repeated for a complete dataset (reach convergence)?

Study These Flashcards

step 2 (fill in missing values with initial estimates)
step 3 (build a prediction model based on step 2)

Are iterations and multiple imputed datasets the same thing?

Study These Flashcards

No. Iterations occur at each MI dataset.

What is the problem with prediction models with GLMs when using small datasets (e.g 100 people)?

Study These Flashcards

GLMs require a sample size larger than the number of predictors
In small datasets= possible to have more variables than people

What is the problem with prediction models with GLMs that specifies interactions?

Study These Flashcards

GLMs only include interactions between variables when specified by analysts, and often don’t know what that needs to be

What does the margin plot (augmented scatter plot) show?

Study These Flashcards

Variable X value when Variable Y is missing (conditional missingness)
Scatterplot of non missing data between 2 continuous variables with margins showing where missing data are

What are 2 ways in which identifying patterns of missing data is useful?

Study These Flashcards

Whether there’s enough overlap of relevant variables to predict missingness
Unexpected patterns that might indicate problems with data

What are 2 tests to conduct to see if 1 variable is significant/systematically related to another variable's missingness?

1. Welch 2 sample T-test 2. Pearson's chi-square

What does it mean when data reaches convergence?

No clear systematic change with additional iterations

When using densityplot() or xyplot() to plot imputed vs. observed data, what colour are they shown in?

Imputed = red Observed = blue

If observed and imputed distributions are similar, what is this indicative of?

data being MCAR or that relevant variables are missing from model

What do these outputs mean using the pool() function: estimate, ubar, b, t, dfcom, df, riv, lambda, fmi

estimate: pooled, average regression coefficients ubar: average sampling variance (V hat) b: between imputation variance (missing data uncertainty) t: total variance (ubar, b, simulation error) dfcom: df for complete data analysis (if there had not been missing data) df: estimated dfs accounting for missingness riv: relative increase variance due to missing data lambda: proportion of total variance due to missing data fmi: fraction of missing information

What do these outputs mean using this the "summary( pool (mi.reg), conf.int = TRUE)" function: estimate, std.error, df, p-value, 2.5% and 97.5%

estimate: pooled average regression coefficients std.error SE of estimates incorporating all 3 uncertainties (ubar, b, simuation error) df: estimated df incorporating uncertainty due to missing data p-value: probability of obtaining estimate in sample given true population value was 0 2.5% and 97.5% = lower and upper limit of 95% CI

What does the function aggr() do

Creates plot of missingness by variable and patterns of missing data for an entire dataset

What does mice() function do

For multiple imputation Takes dataset, number of imputations + iterations

W6: Missing Data Flashcards

(32 cards)