Linear regression Flashcards
(31 cards)
What are the assumptions of linear regression
**Linearity ** (residuals vs fitted plot)
Independence (Residuals (differences between observed and fitted values) should appear randomly scattered with no patterns when plotted. (residuals vs fitted plot)
Constant variance (homoscedsticity) - similar variance of observed vs fitted values.
Normality - regression residuals (prediction errors) are well described by normal distribution (Q-Q plot)
What are residuals?
The difference between the actual and predicted values. Sometimes referred to as “errors”
larger residual, the worse the model is at predicting the outcome.
What is homoscedasticity?
Variability. In linear regrression we assume constant variance. Regression residuals should show similar variance across all levels of fitted values.
What do we mean by fitted values?
Values predicted by the regression model.
What do we mean by independence of the model?
Residuals (errors) are independent of each other. Knowing the error of one data point should tell you nothing about the error of another.
How do we check for linearity in a regression plot?
Look at the residual vs fitted values plot. If the line is mosty horizontal, we can assume linearity.
by default, plot describes the 3 largest absulte value residuals -
plot 1
How do we check for homescedasticity (constant variance) in a model using residual vs fitted plot?
- Looking for the residuals to be equally distributed from the line as fitted values increase
- Problems if there is a “fan” shape (residuals increase or decrease as fitted increases)
- Increasing residuals as fitted increase
Plot 1 (residual vs fitted) or Plot 3 (scale-location)
How do we check for independence in a model?
Looking for random scatter of values
Problems if there is a pattern, trend, or clump of values
If data are measured over time, check especially for evidence of patterns that might suggest they are not independent (eg, plot residuals against time to look for patterns)
Plot 1
How do we check for normality in a model?
- Assess with Normal Q-Q plot (and possibly histogram) of the residuals
- Settle for the residuals satisfying the “nearly Normal” condition
- Problems if standardized residual values high (generally >2 or < -2)
Normality assumption becomes less important as the sample size grows
Plot 2
How do you assess for collinearity of predictors?
- Assess with scatterplot matrix and correlation coefficient of predictors
- Also, after fitting model, look at variance inflation factors (VIF); values >5 suggest a concern
- If collinearity a concern, then drop a predictor
What is MAE?
mean absolute prediction error, lower is better
What is RMSE?
root mean squared prediction error, lower is better
What do AIC and BIC tell you in a model?
indexes where lower value = better model
Akaike Information Criterion and Bayesian Information Criterion
What is Adjusted R2?
adjusted down for the number of covariates in the model; not a variation of anything but an index where higher is better
What is Nominal R2?
Proportion of variation in the outcome explained by the model, higher is better
Key things to determine which model is best?
Nominal R2 = higher is better
Adjusted R2 = higher is better
AIC and BIC = lower is better
RMSE = lower is better
MAE = lower is better
Residuals in a model that meet all assumptions
standardized residuals (in addition to following a Normal distribution in general) will also have mean zero and standard deviation 1, with approximately 95% of values between -2 and +2, and approximately 99.74% of values between -3 and +3
residual 113 and 72 are beyond these parameters, so we need to look at leverage/influence
How do we assess homoscedasticity using the scale-location plot?
Plots the suqare rool of the standardized residuals agains fitted values
Looking for loess smooth to be flat; problem if a slope.
How do we assess leverage?
A residuals vs Leverage plot or Cook’s distance plot
most highly leveraged points are shown furthest to the right in this plot (unusual predictor values)
Points with high influence are shown in the upper or lower right section outside of the dotted line (cook’s distance > .5
Plot 5
What can we do if assumption of linearity is not met in a linear regression model?
o Look for potential transformation of outcome Y with Box-Cox plot
o Look at residual plots against each of the individual predictors
o This will help identify the specific predictor (or predictors) where we need to use a transformation
o If all predictors are problematic, we might transform the outcome Y
o If plot of x vs. y suggests non-linearity (curves) then consider a polynomial or spline for the predictor
What can we do if assumption of homoscedasticity is not met in a linear regression model?
o Consider potential transformations of the predictors, possibly fitting some sort of polynomial term or cubic spline term in the predictors
What can we do if assumption of independence is not met in a linear regression model?
o No specific solution outlined in the course notes or slides; probably just use a different model
What can we do if assumption of normality is not met in a linear regression model?
Consider a transformation of the outcome Y
(use Box-Cox; ladder of power lambda = 1 = unchanged)
What transformation for may be useful for right-skewed data?
consider power transformations below 1 - sqrt = 0.5, natural log = 0, inverse = -1