6. Regression assumptions, Diagnostics and Influencial cases Flashcards

1
Q

How many assumptions are there for multiple linear regression?

A

9 mathematical
+2 design
=11

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 2 design assumptions of multiple linear regression?

A

Independence (each participant only 1 score on each IV)

Interval Scale on IV and DV (or dichotomous IV)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 9 mathematical assumptions of multiple linear regression

A

Normality (6 sub assumptions)

No multicollinearity (3 ways to check)
Linearity 

Normal distribution of residuals
Independent Residuals
Residuals unrelated to predictors

Homogeneity of Variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 6 tests of normality?

A
Symmetry
Modality 
Skew 
Kurtosis 
Outliers 
Shapiro-Wilk
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What do you check with the assumption of symmetry?

A

Mean = Medium = Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What do you check for in modality?

A

Only 1 most frequently occurring score (Unimodal not multi/bimodal)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do you check for in skew and kurtosis?

A

Skew / SE Skew

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What constitutes an outlier?

A

95% of cases should be 1.96

No more than 3% of cases should be >2.58

If there are… they are outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do you check for in the shaprio-wilk statistic?

A

That it is not sig. (>.05)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 3 checks for multicollinearity?

A

Pearson Correlations .01

VIF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does VIF stand for?

A

Variance inflation factor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Where and what for do you look to check whether the residuals have a normal distribution?

A
Mean of residuals = 0 
No skew (Snaking) and No Kurtosis (Sag) in the P-P plot and histogram
No outliers in the histogram
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why are the residual statistics so important?

A

Because if they aren’t normally distributed then we can’t say that 68% of cases will fall within + or -
the RMSE of the regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we check linearity and why do we check it?

A

Using pearson correlation

because if the IV is not related to the DV then it can’t be a good predictor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the Independence of Residuals tested?

A

Using the Durbin-Watson

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the Durbin Watson show?

A

The independence of residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When reading the Durbin-Watson, what are we looking for to meet our assumption of independent residuals?

A

Values between 1.5-2.5

Actual range from 1 = strong pos. to 4= strong neg.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What do we look at to test homogeneity of variance?

A

The scatterplot of standardized predicted value against the studentized residual
(don’t want any funnelling or patterns)

19
Q

What is it called if there is funnelling (and an unequal distribution on either side of x= 0) where the assumption of homogeneity is not met?

A

Heteroscedasticity

20
Q

What is evidence of heteroscedasticity?

A

No pattern or funnelling

Equal distribution on either side of x=0 (divide graph in half)

21
Q

How is the ‘Residuals unrelated to predictors’ assumption checked?

A

By obtaining a Pearson correlation between each IV and the unstandardized residuals (RES_1)

22
Q

What should the correlation between the predictors and the residuals be?

A

0 and non sig.

23
Q

What should we do if the assumptions are violated?

A

Question the validity of the model and caution about the interpretations

24
Q

What three violations of assumptions cause the most problems for linear regression?

A

Normality (especially on the DV)
Homogeneity of Variance
Presence of outliers

25
What are the 3 options if normality is violated?
``` Transformation NO (if sig. skew) Bootstrap YES (less biased) Outliers FIRST (check influence) ```
26
What are considered extreme cases in a data set?
>2SD from the mean
27
What are outliers a problem?
They affect the value of the estimated regression coefficients = biased model
28
Where are the problem cases located in SPSS output?
Casewise Diagnostics
29
What should you look at to determine the amount of influence the outliers are having?
``` Studentized Residuals (Y-Ypred: Error) Influential cases ```
30
What does it mean if a case has a large residual?
It doesn't fit the model well and should be checked as a possible outlier
31
What are the 3 types of residuals?
Unstandardized Standardized Studentized (most precise)
32
What are the 8 statistics that can be used to assess the influence of a particular case on a model?
``` Adjusted predicted value. Deleted residual and the studentized deleted residual. DFFit and standardized DFFit. Cook’s distance. Leverage. Mahalanobis distances. DFBeta and Standardized DFBeta. Covariance ratio. ```
33
What is the rule for Adj Pred Value?
It should be = to predicted value
34
What is the rule for the studentized deleted residual?
Within the range of -2 to 2
35
What is the rule for Mahalanobis Distance?
When: N= 500, 25+ = bad N=100 and k=3, 15+=bad N=30 and k=2, 11+=bad
36
What is the rule for Cook's distance?
1.0+ = bad | Close to 0 = good
37
What is the rule for Leverage values?
If value is 2x more than average leverage value Average Leverage value = (k +1) / n
38
What is the rule for the covariance ratio?
If above upper end of range (> 1 + [3(k + 1)/n]) DON'T DELETE If BELOW lower end of range (
39
What is the rule for DFFit?
Depending on range of scale e.g. either 0-1 or 1-100... the DFFit value should be closer to 1 (i.e. in 0-1 a value of 0.5 is terrible but in 1-100 it's nothing)
40
What is the rule for the SD DFFit?
Should be between -2 and 2
41
What is the rule for SD Df Beta?
If +-> 2 = bad
42
What should we do if we remove the outliers?
Run the regression again and compare the new and old
43
What should happen if the outliers have been correctly removed?
The RMSE should shrink The Rsq should get larger Assumptions should be closer to being met