Week1_Topic 1: Screening and cleaning data Flashcards
1) Why is Data Screening Important?
- An important (often time consuming!) precursor to any serious data analysis.
- Crucial to check relevant statistical assumptions for any subsequent analyses.
- Data screening can often provide an important first insight into the key variables of your study.
2) Data Screening:
What should be your first step?
• The first step in data screening should involve checking for the accuracy of the data entry:
– Out of range values
– Plausible values
– Check accuracy of coding in SPSS
• The Frequencies window in SPSS –
Descriptive Statistics is often useful for the above checks.
3) Missing Data:
It is problematic because it may reduce the
represeativeness of your data
It is a problem in many areas of psychological
research - give 4 examples:
- Participant attrition (sustained pressure)
- Items/tasks not completed
- Completed data misplaced
- Equipment malfunctions
4)
Why is meticulous data collection is important?
– Ensure participants complete all tasks/ questionnaires
– Remind them to check that they’ve completed everything
– Program your task to minimize missing data
5)
Describe - missing completely at random (MCAR):
- Cause of missing data is independent of other variables in the study
- Non-missing data is representative of total data
- Best possible situation, but rare
- Should not cause any problems if relatively small loss (<5%) in moderate and large datasets.
6) Describe - missing at random (MAR):
- The pattern of missingness is predictable from other variables in the dataset.
- For instance, patients might be less likely to complete a certain questionnaire.
- If <5% most missing value procedures yield similar results.
7)
Describe - missing not at random (MNAR):
- Non-random missingness.
- Value of variable is related to reason why it’s missing.
- Patients are less likely to complete a questionnaire because of their scores on the questionnaire.
- Can seriously bias results if left as is.
8) SPSS Missing Value Analysis:
What is a logical appraoch?
- Separate missing and non-missing into two groups and compare on other variables e.g. ttest with IV (miss vs. non-miss) on all other variables.
- Little’s MCAR test: if non-significant then assume MCAR; if significant, but missingness is predictable from other variables (other than DV), then assume MAR. If t-test is significant for missingness across DVs, then assume MNAR.
9)
Dealing With Missing Data:
Omission (but still report it)
– Usually ok if low frequency and MCAR/MAR (<5%), and dataset is moderate to large (default option in SPSS).
– Can be problematic in small datasets and in experimental designs e.g. unequal group sample sizes, loss of power etc
– If MNAR, can distort results.
10)
Dealing With Missing Data
Estimate (‘impute’) missing data - Knowledge of area/previous research is good..
Another option is mean substitution - what are teh concequences?
– Mean substitution
- Easy to implement
- Conservative option
- Reduces variance in variable
11) Dealing With Missing Data
Estimate (‘impute’) missing data using Regression
- Use complete cases to derive a regression equation with a series of IVs predicting relevant DV.
- Use equation to predict scores for missing cases.
- Tends to reduce variance and inflate relationships with other variables.
- Relies on relationships between DV and potential IVs.
12) Dealing With Missing Data
Estimate (‘impute’) missing data using Expectation maximisation (EM)
- Calculates missing values for DVs when missing data is random.
- Uses a maximum likelihood approach to iteratively generate values (usually) using a normal distribution.
- Produces biased standard errors for hypothesis testing.
13) Dealing With Missing Data
Estimate (‘impute’) missing data using Multiple imputation
- Makes no assumptions about randomness of missing data.
- Complex to undertake in SPSS.
14) Contrasting different methods
List 3 recomendations….
- It is worth repeating your analysis with and without missing data when using any type of imputation strategy.
- Discrepant results will be a cause for concern.
- The method you should select should not be based on the analysis outcome.
15)
Define Outliers
Outliers are extreme values on one variable (univariate outlier) or a combination of variables (multivariate outlier), that distort or obscure the results of analyses.
16) Outliers can be the result of:
(list 4 answers)
- Data entry errors
- Invalid missing data coding
- Case sampled is not from the intended population
- Case sampled is from the intended population, but simply represents an extreme value within that population, ie a genuine outlier.
17)
How should you check for univariate outliers?
– Standardise variable and look for absolute values of z > 3.29 (.1% of sample)
– Use graphical methods to inspect for outliers e.g. histograms, boxplots, etc.
– The selection of an outlier detection method should be independent of the results.
– Address univariate outliers as a starting point, as this will often limit the number of multivariate outliers.
18)
How can we check for multivariate outliers?
- Best assessed using formal statistical procedures rather than graphical methods of detection.
- Mahalanobis Distance is the distance of a case from the centroid of the remaining cases (centroid is the intersection of the variable means).
- The MD is tested using a χ2 distribution, with a conservative value of alpha (usually p < .001).
19)
How can the Mahalanobis Distance be assessed in SPSS?
Using regression:
– Use any DV, with relevant variables as predictor IVs.
– Use the Save dialog to request MD values.
– Evaluate MD values using a χ2 distribution at p < .001, with df equal to the number of IVs in the model.
20)
When are Leverage, Discrepancy and Influence used?
– Most often used in the context of multiple regression.
21) Detecting Outliers
What is leverage similar to?
– Leverage (identifying influence values) is similar to Mahalanobis Distance (MD), but measured on a different scale.
22) Detecting Outliers
What does discrepancy measure?
– Discrepancy measures the extent to which a case deviates from others (usually deviation from a straight line).
23) Detecting Outliers
What is influence?
- Influence is the product of both leverage and discrepancy. It is the extent to which regression coefficients change when a case is deleted.
- Influence can be assessed using Cook’s Distance, obtained in the Save dialog of SPSS Regression.
24) Detecting Outliers
Describe the plot for leverage, discrepancy and influence in terms of high, moderate and low

Answer:
- High leverage
- Low discrepancy
- moderate influence



















