Week 5 - Missing Data Flashcards

(24 cards)

1
Q

What are some sources of missing data?

A

Failure of system – recording equipment

- - subject/researcher forgets/held up/sick etc
- - out of range values – detection limits
  • Single questionnaire items (people filling out the Q ….
      • may be inapplicable,
      • lapse of attention, person filling out the Q just missed it
      • too hard to answer
      • doesn’t see the item
  • Sections of, or full, questionnaires
      • added later (decide after having already assessed a number of people, that something needs to be added)
      • subject turns two pages (helped somewhat by online surveys, makes it so each question must answer to continue; though sometimes leads to P stopping entirely
  • Whole occasions in a longitudinal study
      • can’t be contacted
      • refuses
      • unavailable for various reasons
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What factors affect the seriousness of the missing data?

A
  • Number of missing data
  • Relation of data to other variables (this affects how easily we can ‘predict’/’educated guess’ what the data would have been
  • The relationship of missing values to the likelihood of being missing. (MCAR, MAR, MNAR) e.g., a depression questionnaire goes missing because a person became depressed.
  • Threats to representativeness e.g., too few females in a sample relative to in population (can use weighting)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the best ‘solution’ for missing data?

A

AVOID IT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some of the ways you can avoid missing data?

A
    • pretest questionnaires thoroughly to avoid not applicables, skips, omissions – pre-test questionnaires to avoid common sources of missing data.
    • supervise the subjects
    • have a (targeted) follow-up mechansim for longitudinal studies
    • impress ethics committees and research associates with the importance of avoiding missing data
    • use a data-gathering technique which is most congenial for the subjects, given the content of the study - get feedback whether they felt the questionnaire really enabled them to express their feelings etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does it mean to say the data is MCAR?

A

Missing completely at random
- missingness is unrelated to Y(the missing value) and X(other variables in the study).

  • i.e., that the missingness (whether a person was missing at time 2, is unrelated both to the measurement at time 1 and the value they would have given at time 2)
  • E.g., BP subjects measured at time 2 were a random sample of those measured at time 1.
    • e.g., imagine the researchers ran out of money to test everyone a second time, so they took a random sample of people to measure at time 2.

Data are considered missing completely at random when the probability of whether or not an individual is missing a value on a given measurement is unpredictable. That is, there is no systematic underlying process (except for random variation) as to why individuals are missing for a given measurement. It may be that a page of the questionnaire was accidently dropped for one participant, or that some individuals inadvertently skipped a question, or that other individuals were momentarily distracted. Data would be MCAR if (in a perfect world) we could measure all possible reasons why we might suspect individuals might choose to skip a given question and then upon testing these explanations for missingness, we find that there is no relationship between these reasons and the pattern of missingness observed. For example, if there was no way to predict whether or not someone was missing on attention to news, then attention to news would be MCAR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does it mean to say the data is MAR?

A

Missing at Random.
- Missingness is related to some other variable in the study (but not due to the missing variable itself).

  • unrelated to Y (missing variable), adjusted for Xi - being missing is unrelated to the measurement that would have occurred at time 2, once you adjust for X
    • E.G., BP subjects measured at time 2 were those with a BP > 140 at time 1 Their BP at time 2 is unrelated to their being missing, once we adjust for time 1 BP
  • introduces some bias into the testing at time 2.

The second pattern is data missing at random. Data are considered MAR if they are missing because of some potentially observable, nonrandom, systematic process. The title Missing at Random may be a bit of an intuitive trap, however, the pattern is not difficult to understand in spite of this misnomer. Essentially, data are MAR if the probability of missingness for some variable (Y) is predictable based on the value of another variable or set of variables (X). Thus, if we were able to measure all potential X’s, data would be MAR if we could predict the probability that an individual with given characteristics would be missing on Y with this set of X’s. So, for example, if people who had low education were more likely to be missing on attention to news, then attention to news would be MAR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does it mean to say the data is MNAR?

A

Missing Not At Random; Non-ignorable missing

  • The missing data (the fact that it is missing) is related to the value we WOULD HAVE got if we had of got a value (instead of missing data).
  • e.g., we are trying to measure depression (y) and we get a missing value because someone was depressed (y).; the higher they would have rated on our measure, the more likely it is that their data would be missing!.
  • MNAR is sometimes called non-ignorable missing, because the mechanism which led to the missingness cannot be ignored when analysing the data.
  • The essence of NMAR is that being missing is related to the value of that the variable would have had, if we had been able to measure it.

Data are considered missing not at random if they are missing due to the value of the variable being considered. That is, if we are considering the pattern of missing variables on variable Y, it would be MNAR if individuals choose not to respond because of their true value of Y. A classic example is income. Income may often be MNAR because individuals who make an extremely high or low income might choose not to report the value of their income. Thus, the pattern of missingness of the income variable is dependent upon the value of an individual’s income and is MNAR. Considering our example of attention to news, if people who rarely attended to news were more likely to decline to answer a question about attention to news, then attention to news would be MNAR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 3 types of missing data?

A
    1. No missing data – all subjects had scores at time 1 and time 2.
    1. MCAR – some subjects, chosen randomly, had missing time 2 scores,
    1. MAR – the subjects with the highest depression scores at time 1 were set to missing on the measure at time 2.
    1. MNAR – the subjects with the highest scores at time 2 were set to missing at time 2.

(simulation: - 100 subjects were scored at time 1 and time 2 on a measure of depression. A higher score indicates greater depression.
- At time 1 the mean score in the population was 50, sd 5. At time 2 the mean score was 55, sd 5.
- All subjects had a time 1 score, but in some conditions some subjects had no time 2 score.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Look at picture file MISSING DATA1,

Describe how the type of missing data, the number of data missing, and the correlation affect the findings.

The simulation

  • 100 subjects were scored at time 1 and time 2 on a measure of depression. A higher score indicates greater depression.
  • At time 1 the mean score in the population was 50, sd 5. At time 2 the mean score was 55, sd 5.
  • All subjects had a time 1 score, but in some conditions some subjects had no time 2 score.

Four different missing data conditions were used.

    1. No missing data – all subjects had scores at time 1 and time 2.
    1. MCAR – some subjects, chosen randomly, had missing time 2 scores,
    1. MAR – the subjects with the highest depression scores at time 1 were set to missing on the measure at time 2.
    1. MNAR – the subjects with the highest scores at time 2 were set to missing at time 2.
    • Three amounts of missing data at time 2 were used – none, 20%, 40% and 60%
      • Remember, data were never missing at time 1
        • Example: In the MCAR 20% condition, a randomly-selected 20% of subjects had their time 2 scores set to missing.
        • Example: In the MAR 40% condition, the subjects with time 1 scores in the top 40% had their time 2 scores set to missing.
        • Example: In the MNAR 60% condition, the subjects with time 2 scores in the top 60% had their time 2 scores set to missing.
    • Three different correlations between the time 1 and time 2 scores were tested – 0.0, 0.3 and 0.6. (trialled a few conditions, from no correelation to strong correlation)
A
  • The differences for the no missing and MCAR are nearly identical in all conditions
  • Results based on wide (listwise deletion) data are misleading when data are MAR or MNAR, regardless of condition, but less so with higher correlation and a smaller amount of missing data
  • Results based on MAR are accurate with the long dataset and Maximum-Likeihood analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some strategies for dealing with missing data?

A

Avoid it!
otherwise
- listwise deletion
- using available data (Use the data you do have e.g., you have XYZ, and some people are missing on X; so these cases are only excluded from correlations using X)
- substitution (imputation and maximum likelihood
- multiple imputation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some issues with list wise deletion?

List wise deletion is bias unless data are _____.

A
  • Lose power
  • unethical (Waste data -> i.e., people’s time and effort)
  • bias unless data is MCAR
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does it mean to use ‘available data’ when dealing with missing data? (NOT listwise deletion)

What are some problems with it?

A

No attempt to impute missing data. Use the data you do have e.g., you have XYZ, and some people are missing on X; so these cases are only excluded from correlations using X. Would select ‘exclude cases pairwise’.

It can lead to ‘impossible’ correlation matrices.

Difficult to get estimates of variability if combining results from different subsets of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the three methods of substitution and their limitations?

A
  1. Replace missing values with the sample mean for that variable
  • Potentially very biased - imagine the damage you will do to you sample who is towards one end or the other of a distribution and for a variable they are missing you just give them the mean.
  • Probably wouldn’t seriously consider using
  1. Replace the missing values of items in a scale with the mean of the other items for that subject. compute anx_mean = mean.4(anx1 to anx5).
  • See Schafer & Graham, 2002 and Peyre, Lepege, &Coste, 2011.
  • mostly only use this strategy if they have answered some significant portion of the items
  1. Last value carried forward (longitudinal studies)
    - Potentially distorting - at least this is conservative.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some methods of substitution (of missing data)?

What are the issues with these methods?

A
  1. replace missing values with sample meal for that variable - this is potentially very bias. Probably SHOULD NOT USE
  2. Replace missing values of items in a scale with the mean of the other items for that subject (compute anx_mean = mean.4(anx1 to anx5).) – mostly only use this strategy if they have answered a significant portion of the items.
  3. Last value carried forward (longitudinal studies) - potentially distorting - at least this is conservative!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is single shot imputation?

What are some methods of single-shot imputation?

A
  • Replace missing values with values derived from some function of the non-missing data

Can do in SPSS using:

  • Regression modelling (predict missing variable, based on other variables)
  • Expection-Maximisation (EM method) algorithm - a ‘best guess’ that is compatible with the rest of the data as to what a person might have answered

++ AMOS methods

  • Regression
  • Stochastic regression
  • Bayes imputation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the ‘regression’ methods of single shot imputation?

A

A regression model is constructed to predict missing variable, based on other variables .

Can be done through point and click missing values analysis. ‘Noise’ can also be added to the imputed value to reflect uncertainty - this is often based on the residuals.

17
Q

What is the EM method of single shot imputation?

A

The EM method produces an estimated covariance matrix and means for the variables by an iterative process (see Enders, 2001 and Newman, 2003) . These can used to provide imputed values. which are a ‘best guess’ that is compatible with the rest of the data as to what a person might have answered.

Can be done through point and click missing values analysis. ‘

18
Q

What is the best use for single-shot imputations?

A

It is good for small amounts of missing data (e.g., 5%) and it is quick to use!

19
Q

What is the most appropriate general method for dealing with large amounts of missing data?

A

Multiple imputation

20
Q

What is multiple imputation?

What is its advantages over single shot imputation

A

Imputation involves replacing missing data with values which will allow the data to be analysed with complete-data methods.
• The inserted values are usually predicted values based on the observed data
• If only one imputed value (a single-shot imputation) is obtained for each missing value, we have no idea how
much faith to put in that value.
• By obtaining multiple imputed values, we can assess the variability of the values and incorporate this in analyses
• Estimates of the variability (i.e., the standard errors) of parameters come from within- and between-dataset variation.
• The information from the two sources is combined according to rules developed by Rubin. (variability of a coefficient both within and between data-sets — between multiply imputed data set gives us an idea of uncertainty.
• The estimates of the parameters themselves are simply the average of the parameters obtained for each MI dataset.
• The larger the number of MI datasets, the better the combined estimates will be, especially if there are a lot of missing data.

21
Q

What are the 3 stages of multiple imputation?

A
  1. development of the imputation model (spss does this for us, based on the variables we provide )
  2. running of the imputation model to produce MI datasets (spss does this)
  3. Carrying out analyses with MI data set (to produce the actual imputed values)
22
Q

When developing the imputation model for multiple imputation in SPSS, what variables should we include in the model?

A

All the variables which will be used in the final analyses should be included in the imputation model, including derived variables such as interactions. The DV has no special status and should be included in the imputation model.

It may also be beneficial to include other variables.

23
Q

SPSS uses ‘full conditional specification’ when producing imputed values in multiple imputation - what does this mean?

A
  • Uses a specific univariate model for each variable (e.g., linear regression, logistic regression)
  • For a specific imputed dataset, iteratively imputes each variable with missing values, then uses the imputed values in the imputation of other variables. Order matters. (put least missing data variables first)
  • Continues in each dataset until reaches iteration limit.
  • The variability between MI datasets comes from random draws from the distribution of regression coefficients (the PPD) at the beginning of the process for each MI dataset. See: van Buuren (2007).

An example of MI in SPSS,using the previous dataset, with 200 subjects measured at time 1 and time 2. In this example, 30 percent of the observations are MAR at time 2. The correlation between the time 1 and time 2 scores is .30 in the population.

24
Q

Fill in the blanks:

in multiple imputation…

• Include ____(all/none/some) variables to be used in the analysis
+ others which may improve imputations

  • Don’t impute interactions, but do use them in the ____.
  • Too _____(FEW/MANY) variables may produce worse estimates; as will use of _____.

• ___ (wide/long) longitudinal datasets for MI, then ___ (wide/long) for
mixed analyses

• The best method for missing data is ____

• When preparing to analyse your data, make an assessment
of how much _____ there is and how it came to be ____

  • If the amount of missing data is relatively small, it ____(may not/does) matter which method is used
  • If either missing data is MCAR or MAR _____maximum likelihood and _____ methods are optimal

• If the data are ____(MCAR, MARm MNAR), models which incorporate a mechanism
for the missing data may be necessary

A

Include ALL variables to be used in the analysis
+ others which may improve imputations

  • Don’t impute interactions, but do use them in the MODEL
  • Too MANY variables may produce worse estimates; as will use of contraints.
  • WIDE longitudinal datasets for MI, then LONG/stacked for mixed analyses
  • the best method for missing data is AVOIDANCE

• When preparing to analyse your data, make an assessment
of how much missing data there is and how it came to be missing

  • If the amount of missing data is relatively small, it MAY NOT matter which method is used
  • If either missing data is MCAR or MAR maximum likelihood and multiple imputation methods are optimal a

• If the data are MNAR, models which incorporate a mechanism
for the missing data may be necessary