Week 3 Data Cleaning Flashcards

Question

What do we use to identify skew or kurtosis (i.e. non-normal distributions) in data?

Answer 1

We use note violations with Kolmogorov Smirnov or Shapiro Wilks to identify skewed or kurtosis data to identify if it is causing a significant problem

Answer 2

``` Skewness = positively or negatively skewed. Kurtosis = too peaked (leptokurtic) (remember Leap) Kurtosis = too flat (Platykurtic) (remember Plateau) ```

Answer 3

* does it fit into a normal distribution curve (NDC). - This assumption underpins all parametric statistics. Several ways of assessing this: - Stem & leaf plots, box plots, histograms - skewness and kurtosis.

Answer 4

Linearity: is there a straight-linear relationship between the two variables? Non-linearity is diagnosed either from residuals plots in analyses involving a predicted variable or from bivariate scatterplots between pairs of variables.

Answer 5

Ungrouped data: *the variability in scores for one continuous variable is roughly the same at all values of another continuous variable. For grouped data: * the same as the assumption of homogeneity of variance when one of the variables is discrete (the grouping variable), the other is continuous (the DV); * the variability in the DV is expected to be about the same at all levels of the grouping variable

Answer 6

Multicollinearity and Singularity are problems with a correlation matrix that occur when variables are too highly correlated. With multicollinearity, the variables are very highly correlated - say .90 and above. With singularity, the variables are redundant; one of the variables is a combination of two or more other variables. P.88 T&F. They cause both logical and statistical problems. See another card!

Answer 7

* multicollinearity, the variables are highly correlated (say, .90 and above); * singularity, the variables are redundant; one of the variables is a combination of two or more of the other variables NB: Hills suggests that intercorrelations of .8 or even .7 perhaps should also be avoided when using regression to avoid problems in interpretation.

Answer 8

*Parametric analyses, (e.g. t-tests, ANOVA,) assume that the sample tested is representative of the population and equal in variance across groups. *Levene's test this assumption by evaluating whether the Homogeneity of Variance in your sample is violated. Greater is good remember >.05 and therefore not significant. 8This then indicates that the variability across the groups is similar and the emergent differences, not what is causing the difference between groups

Answer 9

* The sample size required is dictated by the specific type of analysis you undertake and whether parametric or non-parametric statistics are used. * If parametric statistics are used your sample size may also affect whether you are able to undertake the use of univariate or multivariate analyses.

Answer 10

*Parametric statistics – more stringent set of assumptions. *Non-parametric statistical analysis - less stringent criteria. non-parametric tests should be used when: -Ordinal level data is analysed -Sample size is small <10 per cell -Sample size is small and unequal * Nonparametric tests can be used as an alternative strategy for dealing with outliers in parametric tests due to the ordinal nature of the analysis of the data. * Nonparametric statistics are based on ranks that are not affected by extreme scores, and they do not require normality

Answer 11

EM stands for Expectation Maximization. EM methods are available for randomly missing data. Em forms a missing data correlation (or covariance) matrix by assuming the shape of a distribution (such as normal) for the partially missing data and basing inferences about missing values on the likelihood under that distribution. IT IS AN ITERATIVE PROCEDURE WITH TWO STEPS: Expectation & Maximisation. P.68 T&F

Answer 12

EM is a repetitious process with two steps: 1. Expectation- the E step finds the conditional expectation of the of the "missing data"' given the observed values and current estimate of the parameters, such as correlations. These expectations are then substituted for the missing data. 2. M step- performs maximum likelihood estimation as though the missing data had been filled in. Fina lly, after coverage is achieved, the EM variance-covariance matrix is provided...the filled in data saved in the data set.

Answer 13

Because error is not added to the transformed data set.

Answer 14

EM stands for Expectation Maximization. EM methods are available for randomly missing data. EM forms a missing data correlation (or covariance) matrix by assuming the shape of a distribution ( such as normal) for the partially missing data and basing inferences about missing values on the likelihood under that distribution. IT IS AN ITERATIVE PROCEDURE WITH TWO STEPS: Expectation & Maximisation.

Answer 15

EM is a repetitious process with two steps: 1. Expectation- the E step finds the conditional expectation of the of the "missing data"' given the observed values and current estimate of the parameters, such as correlations. These expectations are then substituted for the missing data. 2. M step- performs maximum likelihood estimation as though the missing data had been filled in. Finally, after coverage is achieved, the EM variance-covariance matrix is provided...the filled in data saved in the data set.

Answer 16

Because error is not added to the data set.

Answer 17

Univariate outliers are cases with an outlandish value on one variable. Z score in excess of +/- 3.29. Easier to spot than multivariate outliers... Among dichotomous variables, the cases on the "wrong" side of a very uneven split are likely to be univariate outliers. Among continuous variables, univariate outliers are cases with very large standardized scores, z-scores on one or more variables...disconnected from the other z-scores in excess of 3.29.

Answer 18

Multivariate outliers are cases with an unusual combination of scores on two or more variables. E.g. 15 year old normal bounds. Someone earning $45,000.00 /yr is normal... BUT a 15 year-old who earns $45,000.00 a year is not normal - multivariate outlier.

Week 3 Data Cleaning Flashcards

to provide information from Slides 9 - onwards (separate slides for decision trees) (42 cards)