L21 Part 2 - Single regression (chapter 8 part 1) Flashcards
What is linear regression?
Models the relationship between a scalar dependent variable y and one or more explanatory variables x
↪ outcome = model prediction + error
- One explanatory variable → single linear regression
- It models it using linear predictor functions whose unknown model parameters are estimated from the data
What is the formula for linear regression?
Picture 1 - expresses how our model predicts (is there more accuracy or error?)
Y - outcome variable
Bs - parameters and they represent what we’re interested in (when the predictor is 0)
- B0 - intercept, baseline level that we are predicting with
- B1 - regression coefficient for our single predictor variable and it quantifies how strong the association is between our predictor and outcome variable
↪ we multiply this with our predictor variable X and this product gives us the model prediction
↪ to calculate B1, we use correlation between the two variables, so the higher the correlation the stronger the predictive value the predictor has
- when we add hats to the bs they are estimates of the population using a sample
E - erros in the prediction in our sample model (residuals)
Assumptions for linear regression
picture 14 - general procedure of fitting a regression model but it shows what to do with each assumption
- Continous variables
- linearity
- independent errors of the observations
- Sensitivity (outliers)
- Homoscedasticity (equivalent to equal variances in anova)
- Normality (model residuals are normally distributed; visualised with QQ plots)
What is linearity?
For this assumption to hold, the predictors must have a linear relation to the outcome variable
- checked through: correlations, matrix scatterplot with predictors and outcome variable
What is sensitivity?
Potential influence of outliers
We look at outliers through:
- Extreme residuals
- Cook’s distance
- Check Q-Q, residuals plots, casewise diagnostics (cook’s distance)
What is the difference between unstandardized residuals and standardized residuals?
Residuals represent the error present in the model (small residual = model fits the sample data well)
Unstandardized residuals - raw differences between predicted and observed values of the outcome variable
↪ measured in the same units as the outcome variable which makes it difficult to generalize
Standardized residuals - residuals converted to z-scores and so are expressed in SD units (mean 0, sd 1)
- With standardized residuals we can assess which data points are outside of the general pattern of the data set
Diagnostic statistics
What is leverage?
It gauges the influence of the observed value of the outcome variable over the predicted values
- Defined as (k+1)/n, k is the number of predictors in the model, n is the number of cases
- Can vary from 0 (no influence) to 1 (the case has complete infleunce over predictions)
- If no cases exert undue influence over the model all leverage values should be close to the average value of (k+1)/n
- Those values greater than twice the average should be investigated
What is cook’s distance?
A measure of the overall infleunce of a case on the model
↪ compute for every observation separately and it assesses to what extent our result would change with and without that observation (outlier - removing or not this participant affects our data and results in high cook’s distance)
↪ Cook’s distance should be < 1 for this assumption to be met
↪ combines the point’s leverage and its residual; point with high leverage and high residual will have a large cook’s distance, so strong influence on the fitted values across the model
How does outlier affect our correlation?
Picture 2
The correlation is higher when the outlier is removed because it’s not a follower of the general pattern of the data
What do we do when there is an outlier?
We always must follow up on this outlier and investigate why there is this outlier, we should never just remove it from our data
- see them as a source of information not as annoyance
What is homoscedasticity? How do we assess it?
Variance of residuals should be equal (equally distributed) across all expected values → no systematic errors
- Assess by looking at scatterplot of standardized residuals: predicted values vs. residuals → roughly round shape is needed (spread out points equally, no pattern in the errors)
- After the analysis is complete because it’s based on the residuals
- picture 2
What is cross-validation? Why do we use it?
Assessing the accuracy of a model across different sample
- To generalise, model must be able to accurately predict the same outcome variable from the same set of predictors in a different group of people
- If the model is applied to a different sample and there is a severe drop in its predictive power, then the model does not generalise
How large should be our sample?
It depends on the size of effects that we’re trying to detect and how much power we want to detect these effects
Why does it matter what size of the sample we have in terms of R?
The estimate of R that we get from regression is dependent on the number of predictors, k, and the sample size, N
- the expected R for random data (should be 0) is k/(N−1) and so with small sample sizes random data can appear to show a strong effect
- E.g. 6 predictors, 21 cases of data; R = 6/(21-1) = 0.3
What can we do as a first step in the analysis of our data?
Create a scatterplot to see whether the data are somehow associated and in which direction
- The strength of the correlation decided later, this is just to get an idea of the data
What is the regression line/least squares line?
A line that is as close as possible to all of our points (picture 15 - orange line the least squares line and the blue line null model so if we don’t use any predictors to estimate y)
- calculated with the regression formula using the regression weights (bs) that minimize the sum of squares
- it’s the optimal fit to our data points using regression weight
- it’s our model error (unexplained variance) = because the orange line is the distance between the data points and what we predict based on our regression formula
How do we get to the model sums of squares?
We can turn the model error sums of squares into the proportion against the total variance (e.g. 7.68/10.24 = 0.75) and then take the complement of that (e.g. 1-(7.68/10.24) = 0.25) and this represents the proportion of the explained variance
- the trick is that if we were to take the square root of that we would get the same number as the correlation between our two variables (0.5) → this works because we only have two predictors but once we increase the number of predictors, it’s not gonna be the same
Look at picture 15 to see the numbers
How do we calculate b1?
Picture 3
It’s based on the correlation between the two variables
- it’s a type of standardised statistic but it’s not bound between 1 and -1
- quantifies how strong the association is between our two variables (no association - b1 would be 0)
B1 is the slope coefficient - determines the slope of the line (positive - increasing slope, negative - decreasing slope, close to 0 - no association)
How do we calculate b0?
Picture 4
Based on the average of our outcome variable and we subtract our b1 multiplied by the average of our predictor variable
It determines at what point do we cross the Y axis when the predictor is 0
- if our predictor is 0 what does our model predict for the outcome? So then we just predict the average value of y
What is the interpretation of the slope coefficient?
Represents the change in the outcome associated with a unit change in the predictor
- If we increase our predictor by 1 unit, we predict the outcome variable to increase by b1 units
Picture 5 and picture 6
Now that we have b0 and b1, what can we do?
We can quantify the prediction for the outcome variable based on the value of x (airplay in picture 7) and then see how close are our predictions to our observed data (y with a hat vs y)
- Our regression line tells us what our model predicts (the same thing that group means told us in anova, but now we have a continous variable so it’s a regression line)
Now that we have our predictions what is the next step?
We can look at the error/residuals which is the difference between the model predictions and observed values
- and now we can test for homescedasticity
How do we test for homoscedasticity?
Using a scatterplot of whether there is an association between the size of our model error and the size of the prediction (predictions vs residuals)
- we want 0 correlation - no systematic error going on
- picture 8
How do we compare our observed data to our predicted ones?
We create a scatter plot (picture 16) and calculate the correlation, i.e. fit of the model (!in multiple regression the slope in the scatter plot is not equal to the correlation of the model fit!)
- the stronger the correlation, the better our model did