Topic 5: Linear Regression Flashcards

1
Q

Given bivariate data, what are the steps for a linear regression framework?

A

1) Produce scatter plot

2) Produce regression line

3) Calculate correlation coefficient

4) Produce residual plot

5) Check assumptions

6) Perform predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does checking assumptions involve? What happens if the assumptions are true?

A

CHecking scatterplot to look linear

Ensure residual plot looks random

If these assumptions are true, the linear model is appropriate for use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does a residual plot look at?

A

Looks at gaps between linear regression line and the different points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

WHat is a scatter plot?

A

It is the graphical summary of 2 variables on the same plane, resulting in a cloud of points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is linear association between 2 variables?

A

Describes how tightly the points cluster around a line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are strong and weak associations?

A

Strong: Cloud of points tightly clustered around a line

Weak: Points aren’t tightly clustered around the line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a positive and negative association?

A

Positive association is when one variable increases, another increases as well

Negative association is when one variable increases, another decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 5 things that a scatter plot can be summarised by

A

mean of x
mean of y
sd of x
sd of y
correlation coefficient (r)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the centre of the cloud represented by?

A

By the point of averages (mean of x, mean of y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the horizontal spread of cloud measured by?

A

sd of x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the vertical spread of cloud measured by

A

sd of y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the correlation coefficient?

A

A numerical summary which measures clustering around a line. Indicates sign and strength of linear association. It is between -1 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is population correlation coefficient?

A

Mean of the product of variables in standard units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does r (correlation coefficient) measure association?

A

r divides scatter plots into 4 quadrants, at the point of averages (centre)

Majority of points in the upper right and lower left quadrants –> overall positive r

Majority of points in the upper left and lower right quadrants –> overall negative r

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some properties of correlation coefficient (r)

A

It is a pure number (no units)

lies between -1 and 1

r = 0 occurs when points dont fit around a line but could still happen in multiple ways just not linearly

Correlation coefficient isn’t affected by interchanging the variables (switching x and y axis)

Correlation coefficient is shift and scale invariant (doesn’t change with different shifts to the graph or different extent of scaling)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the two options for a line which represents relationship between two variables?

A

SD line and Regression lineW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the SD line and why isn’t it preferred

A

Connects points of averages (mean of x, mean of y) to (mean of x + sd of x, mean of y + sd of y) (for r > 0)

OR

(mean of x, mean of y) to (mean of x + sd of x, mean of y - sd of y) (for r < 0)

It isn’t preferred because it is insensitive to amount of clustering around the line and thus underestimates (LHS) and overestimates(RHS) at the extremes

18
Q

What is the regression line and why is it a better option

A

Connects the point of averages to (mean of x + SD of x, mean of y + r * SD of y)

Accounts for extremes and clustering through use of the correlation coefficient

19
Q

What is the point of averages

A

coordinate of (mean of x, mean of y)

20
Q

WHat is the graph of averages

A

Plots the average y for each x value

Regression line is a smoothed out version of a graph of averages

21
Q

What is a residual?

A

vertical distance or ‘gap’ of a point above or below the regression line

Represents error between actual value and prediction

22
Q

What do residual plots graph?

A

Graphs residuals vs x

23
Q

How do we know if linear fit is appropriate based on residual plots?

A

There shouldm’t be a pattern –> random, because it shows variance is constant, and if not residuals aren’t random and violates the assumptions

24
Q

What are 11 common mistakes of regression

A

1) r doesn’t mean percentile. I.e. r = 0.8 doesn’t mean 80% of points clustered around line

2) r = 0.8 doesn’t mean that points are twice as tightly clustered than r = 0.4

3) Outliers can overtly influence correlation coefficient

4) Nonlinear associations can’t be detected by correlation coefficients

5) The same correlation coefficient can arise from very different data –> still need to be careful

6) Rates of averages tends to inflate correlation coefficient. I.e. a line between the two variables which are group means tends to overestimate strength of association between the two variables

7) Association doesn’t mean causation

8) Small SDs can make correlation look bigger

9) Beware of extrapolating beyond the range of the regression line

10) A high correlation coefficient that fits regression line might not even have data which is linear

11) Beware of refitting – even though correlation coeff might be same if x and y are switched, we need to refit the model depending on what fits the context

25
Q

What is the prediction error?

A

Difference between the line of regression (predicted value) and a certain point (actual value)

26
Q

What can be used to measure prediction error?

A

RMS error. It represents the average gap between the points and the regression line. I.e. ‘standard deviation for the line’

27
Q

What is the formula for RMS error (pop)

A

root of (mean of (gaps) ^2)

= root (1 - r^2 x SD of y)

28
Q

What are some important notes of RMS error?

A

Perfect correlation ( r = -1, 1) –> RMS error = 0

r = 0 –> RMS error = SD of y

If we use mean of y for any x (baseline prediction) –> RMS error = SD of y

To calculate RMS errors for sample vs population, only difference is multiplication of popsd for population compared to sd for sample

29
Q

What are 4 methods of making predictions

A

Baseline prediction

Prediction in strip

Regression line

Predicting percentile ranks

30
Q

What does baseline prediction involve?

A

Given a certain value of x, basic prediction of y would be the avg of y over all x values in the data

31
Q

What does prediction in strip involve?

A

Average of all y values in data corresponding to that x value

32
Q

What does prediction through regression line involve?

A

Use the given equation for regression line to predict y

33
Q

What does prediction through percentile ranks invovle?

A

If x is at a certain percentile of all x’s, find the percentile which we coiuld predict y to be in

Steps:
Find z score in x direction (Zx)

FInd predicted z score in y direction ( = r x Zx)

Translate z score in y direction back to percentile in y direction

34
Q

What does homoscedastic mean?

A

Variance of residual or error term is constant

35
Q

What does heteroscedastic mean

A

Unequal variance of residual

36
Q

How do we know if something is homoscedastic

A

You can tell if a regression is homoskedastic by looking at the ratio between the largest variance and the smallest variance. If the ratio is 1.5 or smaller, then the regression is homoskedastic.

If vertical strips on a scatter plot show equal spread in the y direction it is homoscedastic

RMS error can be used as a measure of spread for individual strips

37
Q

How do we know if something is heteroscedastic

A

If vertical strips dont show equal spread on a y direction in a regression line (or the residual plot)

38
Q

What are the implications of homoscedasticity?

A

Normal approximation can be used within the vertical strips

39
Q

How do we get the normal distribution of the strip?

A

New mean of y in new strip = mean of y + r x (Z score of x) x (SD of y)

New SD of y = RMS error (assume population)

40
Q
A