W6: Missing Data Flashcards

1
Q

Name 2 reasons why missing data is problematic.

A
  1. Results from non-missing data = biased
  2. Causes loss of efficiency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is list-wise deletion?

A

Analysing only complete cases
E.g discard data in x and z because observation on y is missing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is MCAR (missing completely at random)?

A

Missingness completely independent of the estimate of our parameter(s) of interest
E.g dog is the missingness mechanism of homework with missing values regardless of value of homework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is there an empirical way to determine which missing data assumption is correct? (I.e MCAR / MAR / MNAR?)

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is 1 way to assess missing data assumptions?

A

Sensitivity analysis: what happens to results by comparing imputed data and list-wise deletion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is MAR (missing at random)?

A

When missingness is conditionally independent of the unobserved estimate of our parameter(s) of interests
Conditionally dependent on values of variable we observed
E.g dog eats homework only if the student is female

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is MNAR (missing not at random)?

A

Missingness is associated with the estimate of our parameter(s) of interest
Dependent on unobserved/uncollected values of the outcome of interest
E.g dog eats bad homework, missingness = directly related to the parameters of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Will listwise deletion yield biased or unbiased estimates for
a) MCAR?
b) MAR?
c) MNAR

A

a) Unbiased
b) Possible to recover unbiased estimates if the variables missing values are conditioned on are present
But biased if only using complete cases (i.e from list wise deletion)
c) Cannot recover unbiased estimates (data needed to recover them is itself missing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are multiple imputations?

A

Estimation of new value for the missing value over multiple datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the outcome of multiple imputations?

A

Pooled estimates (includes sampling uncertainty (V hat) so it’s not called average estimates)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What type of uncertainty/variation is calculated in each imputed dataset?

A

Sampling variation, overall uncertainty estimate (V hat)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What 3 types of variation/uncertainty does the average of a set of imputed estimates (i.e. pooled estimates) have? These add up to total uncertainty

A
  1. Sampling variation (V hat)
  2. Between-imputed dataset variation / Missing data uncertainty
    (estimate of uncertainty from changes in imputed missing data itself across imputed datasets)
  3. Uncertainty from not generating an infinite number of imputed datasets (B/M)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What can you do to lower the level of uncertainty in multiple imputations?
Is it a good solution?

A

Increase m, the number of imputed datasets
No, get diminishing returns and slow analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How many imputed datasets are recommended?

A

25-100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

MI steps in R (step 1)
What dataset do you start with?

A

Raw dataset with missing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

MI steps in R (step 2)
What are missing values filled with?

A

Initial estimates e.g mean, random numbers from observed data range

17
Q

MI steps in R (step 3)
What model is built for each variable with missing data using other variables?

A

Prediction model, GLMs commonly used as these models but can use any

18
Q

MI steps in R (step 4)
What is the prediction model used for?

A

To predict missing data
These predictions do not capture uncertainty. We only want uncertainty in imputations, not data (predictions) used to build imputation model

19
Q

MI steps in R (step 5)
What steps are repeated for a complete dataset (reach convergence)?

A

step 2 (fill in missing values with initial estimates)
step 3 (build a prediction model based on step 2)

20
Q

Are iterations and multiple imputed datasets the same thing?

A

No. Iterations occur at each MI dataset.

21
Q

What is the problem with prediction models with GLMs when using small datasets (e.g 100 people)?

A

GLMs require a sample size larger than the number of predictors
In small datasets= possible to have more variables than people

22
Q

What is the problem with prediction models with GLMs that specifies interactions?

A

GLMs only include interactions between variables when specified by analysts, and often don’t know what that needs to be

23
Q

What does the margin plot (augmented scatter plot) show?

A

Variable X value when Variable Y is missing (conditional missingness)
Scatterplot of non missing data between 2 continuous variables with margins showing where missing data are

24
Q

What are 2 ways in which identifying patterns of missing data is useful?

A
  1. Whether there’s enough overlap of relevant variables to predict missingness
  2. Unexpected patterns that might indicate problems with data
25
Q

What are 2 tests to conduct to see if 1 variable is significant/systematically related to another variable’s missingness?

A
  1. Welch 2 sample T-test
  2. Pearson’s chi-square
26
Q

What does it mean when data reaches convergence?

A

No clear systematic change with additional iterations

27
Q

When using densityplot() or xyplot() to plot imputed vs. observed data, what colour are they shown in?

A

Imputed = red
Observed = blue

28
Q

If observed and imputed distributions are similar, what is this indicative of?

A

data being MCAR or that relevant variables are missing from model

29
Q

What do these outputs mean using the pool() function:
estimate, ubar, b, t, dfcom, df, riv, lambda, fmi

A

estimate: pooled, average regression coefficients
ubar: average sampling variance (V hat)
b: between imputation variance (missing data uncertainty)
t: total variance (ubar, b, simulation error)
dfcom: df for complete data analysis (if there had not been missing data)
df: estimated dfs accounting for missingness
riv: relative increase variance due to missing data
lambda: proportion of total variance due to missing data
fmi: fraction of missing information

30
Q

What do these outputs mean using this the “summary( pool (mi.reg), conf.int = TRUE)” function:
estimate, std.error, df, p-value, 2.5% and 97.5%

A

estimate: pooled average regression coefficients
std.error SE of estimates incorporating all 3 uncertainties (ubar, b, simuation error)
df: estimated df incorporating uncertainty due to missing data
p-value: probability of obtaining estimate in sample given true population value was 0
2.5% and 97.5% = lower and upper limit of 95% CI

31
Q

What does the function aggr() do

A

Creates plot of missingness by variable and patterns of missing data for an entire dataset

32
Q

What does mice() function do

A

For multiple imputation
Takes dataset, number of imputations + iterations