Week1_Topic 1: Screening and cleaning data Flashcards

Question

25) Detecting Outliers Describe the plot for leverage, discrepancy and influence in terms of high, moderate and low

Answer 1

Answer: High leverage High discrepancy High influence

Answer 2

Answer: Low leverage High discrepancy Moderate influence

Answer 3

Can create a dummy variable with the outliers as one value and the rest of the cases with another value. Use the dummy variable as a DV in **logistic regression**, to determine the IVs that best predict group membership. Means, SDs etc can then be checked for the outliers on the variables identified in this way.

Answer 4

1. Check for data entry error, coding mistakes etc. 2. Omit cases if not part of the population (with caution). 3. Otherwise, * Transform variable * Reduce/increase score to next unit above/below the next highest/ lowest score (e.g. Winsorize / nearest-neighbor). * Trim an a priori percentage of cases from the upper and lower tails of the distribution.

Answer 5

1. If you still have multivariate outliers after dealing with univariate outliers and N is large, may be simplest to omit cases. 2. Run analyses with and without outliers to check on effect of omissions. 3. Make sure all outlier remedies are reported in the results section. 4. Non-parametric statistics are less sensitive to the influence of outliers, but not wholly insensitive.

Answer 6

* Assumption of multivariate normality is the assumption that each variable, and all linear combinations of the variables, are normally distributed. * This assumption underlies the use of many statistical tests and tests become less robust as distributions depart from normality.

Answer 7

* normality applies to sampling distribution of the variables means. With large enough N (ie \> 20), **Central Limit Theorem** shows sampling distribution will be normally distributed regardless of the distribution of the variables

Answer 8

– Testing for **multivariate normality** impractical. The assumption can be largely checked by examining the normality, linearity, and **homoscedasticity** of individual variables and any analysis residuals.

Answer 9

Homoscedasticity. This assumption means that the variance around the regression line is the same for all values of the predictor variable (X). The plot shows a violation of this assumption. For the lower values on the X-axis, the points are all very near the regression line

Answer 10

* Kolmogorov-Smirnoff test * Skewness is the degree of symmetry in the distribution. * Kurtosis is the peakedness or flatness of the distribution.

Answer 11

A normal distribution has skewness and kurtosis values of 0. ## Footnote **Skewness** and **kurtosis** values and their standard errors are provided by SPSS Frequencies. Can test a hypothesis using a z score that skewness and kurtosis differ significantly from 0. z = skewness/**standard error** (SE), z = kurtosis/SE Small samples (n\<50): Z\>1.96: data is non- normal

Answer 12

For small to moderate N, use **conservative alpha** values e.g. p \< **.001**. For large N, likely to make a **Type 1 error** as SEs become **small**. Better to examine the **shape of the distribution** or look at absolute values.

Answer 13

**Curran et al. (1996)** suggest normality can be assumed if skewness values are not greater than **abs val 2** and kurtosis values are not greater than **abs val 7.** Note that the consequences of departures from **normality** are generally more serious when variables are **skewed** in different directions.

Answer 14

1. **Frequency histogram** (with normal curve overlay in SPSS) 2. **Normal probability plot** * For normality, points should fall along the diagonal line. * Plots observed values against expected values based on a normal distribution. 3. **Detrended normal probability plot** • Similar to above, but plots the deviations from the diagonal.

Answer 15

1. Transform variable (eg log transform) 2. Winsorize/Trim 3. Check modified variable to ascertain normality 4. Examine how any changes impact analyses 5. Choose a method based on what is effective in yielding a normal distribution 6. If normality violations can’t be corrected, consider using non-parametric statistics

Answer 16

* The assumption that variables have linear (straight line) relationships with each other. * Underlies many statistical tests e.g. Pearson correlation, regression, etc * Can be assessed using bivariate scatterplots. * Some variables may inherently have a non-linear * relationship. * See Tabachnick and Fidell for alternate analyses if nonlinearity present.

Answer 17

Quadratic relationship

Answer 18

**Cubic relationship**

Answer 19

Homoscedaticity is the **assumption** that **variance** in scores for one **continuous variable** is roughly the same at all levels of **another continuous variable**.

Answer 20

variability in **DV** is expected to be similar across all levels of the **discrete IV**.

Answer 21

Can be assessed with **Levene’s test**.

Answer 22

For ungrouped data, inspect the **bivariate scatterplots.** **Heteroscedasticity** is caused by **nonnormality** in _one_ or _both_ of the variables. Tests usually robust to some heteroscedasticity.

Answer 23

**Homoscedasticity** with both variables normally distributed.

Answer 24

Heteroscedasticity with skewness on one variable

Answer 25

1. Generally recommended if there are significant departures from normality and/or problems with outliers etc. 2. An exception may be if the variable is measured on an inherently meaningful scale, as interpretation becomes more difficult after transformation. 3. It is worth trying several types of transformations to try to produce the best distribution. 4. Always check a transformed variable for normality etc. 5. If no transformation seems to work, try dichotomizing the variable or using NP statistics.

Answer 26

Log transformations are often helpful with skewed data.

Answer 27

**Multicollinearity** is a problem that occurs in a **correlation matrix** when variables are too **highly related** to each other (e.g. \> .90).

Answer 28

1. Conceptually, it indicates redundancy in the variables. 2. It can be an important issue in regression. 3. It doesn’t necessarily impact the model as a role (eg in regression), but rather individual predictors

Answer 29

* Basic principle is to examine overlap between variables. * Examine correlation matrix for preliminary check. * Specific test depends to some degree on type of analysis e.g. tolerance/VIF in multiple regression (more on this in a few weeks).

Answer 30

Remove or combine relevant variables

Answer 31

– See Tabachnick and Fidell pg. 91 for a checklist of data screening prior to analysis. – Also see end of chapter 4 for fully worked examples of data screening using SPSS, for ungrouped and grouped data.

Week1_Topic 1: Screening and cleaning data Flashcards

(56 cards)