Topic 5: Linear Regression Flashcards
Given bivariate data, what are the steps for a linear regression framework?
1) Produce scatter plot
2) Produce regression line
3) Calculate correlation coefficient
4) Produce residual plot
5) Check assumptions
6) Perform predictions
What does checking assumptions involve? What happens if the assumptions are true?
CHecking scatterplot to look linear
Ensure residual plot looks random
If these assumptions are true, the linear model is appropriate for use
What does a residual plot look at?
Looks at gaps between linear regression line and the different points
WHat is a scatter plot?
It is the graphical summary of 2 variables on the same plane, resulting in a cloud of points
What is linear association between 2 variables?
Describes how tightly the points cluster around a line
What are strong and weak associations?
Strong: Cloud of points tightly clustered around a line
Weak: Points aren’t tightly clustered around the line
What is a positive and negative association?
Positive association is when one variable increases, another increases as well
Negative association is when one variable increases, another decreases
What are the 5 things that a scatter plot can be summarised by
mean of x
mean of y
sd of x
sd of y
correlation coefficient (r)
What is the centre of the cloud represented by?
By the point of averages (mean of x, mean of y)
What is the horizontal spread of cloud measured by?
sd of x
What is the vertical spread of cloud measured by
sd of y
What is the correlation coefficient?
A numerical summary which measures clustering around a line. Indicates sign and strength of linear association. It is between -1 and 1
What is population correlation coefficient?
Mean of the product of variables in standard units
How does r (correlation coefficient) measure association?
r divides scatter plots into 4 quadrants, at the point of averages (centre)
Majority of points in the upper right and lower left quadrants –> overall positive r
Majority of points in the upper left and lower right quadrants –> overall negative r
What are some properties of correlation coefficient (r)
It is a pure number (no units)
lies between -1 and 1
r = 0 occurs when points dont fit around a line but could still happen in multiple ways just not linearly
Correlation coefficient isn’t affected by interchanging the variables (switching x and y axis)
Correlation coefficient is shift and scale invariant (doesn’t change with different shifts to the graph or different extent of scaling)
What are the two options for a line which represents relationship between two variables?
SD line and Regression lineW
What is the SD line and why isn’t it preferred
Connects points of averages (mean of x, mean of y) to (mean of x + sd of x, mean of y + sd of y) (for r > 0)
OR
(mean of x, mean of y) to (mean of x + sd of x, mean of y - sd of y) (for r < 0)
It isn’t preferred because it is insensitive to amount of clustering around the line and thus underestimates (LHS) and overestimates(RHS) at the extremes
What is the regression line and why is it a better option
Connects the point of averages to (mean of x + SD of x, mean of y + r * SD of y)
Accounts for extremes and clustering through use of the correlation coefficient
What is the point of averages
coordinate of (mean of x, mean of y)
WHat is the graph of averages
Plots the average y for each x value
Regression line is a smoothed out version of a graph of averages
What is a residual?
vertical distance or ‘gap’ of a point above or below the regression line
Represents error between actual value and prediction
What do residual plots graph?
Graphs residuals vs x
How do we know if linear fit is appropriate based on residual plots?
There shouldm’t be a pattern –> random, because it shows variance is constant, and if not residuals aren’t random and violates the assumptions
What are 11 common mistakes of regression
1) r doesn’t mean percentile. I.e. r = 0.8 doesn’t mean 80% of points clustered around line
2) r = 0.8 doesn’t mean that points are twice as tightly clustered than r = 0.4
3) Outliers can overtly influence correlation coefficient
4) Nonlinear associations can’t be detected by correlation coefficients
5) The same correlation coefficient can arise from very different data –> still need to be careful
6) Rates of averages tends to inflate correlation coefficient. I.e. a line between the two variables which are group means tends to overestimate strength of association between the two variables
7) Association doesn’t mean causation
8) Small SDs can make correlation look bigger
9) Beware of extrapolating beyond the range of the regression line
10) A high correlation coefficient that fits regression line might not even have data which is linear
11) Beware of refitting – even though correlation coeff might be same if x and y are switched, we need to refit the model depending on what fits the context