Linear regression Flashcards by Marie Masotya

What are the assumptions of linear regression

**Linearity ** (residuals vs fitted plot)
Independence (Residuals (differences between observed and fitted values) should appear randomly scattered with no patterns when plotted. (residuals vs fitted plot)
Constant variance (homoscedsticity) - similar variance of observed vs fitted values.
Normality - regression residuals (prediction errors) are well described by normal distribution (Q-Q plot)

How well did you know this?

Not at all

Perfectly

What are residuals?

The difference between the actual and predicted values. Sometimes referred to as “errors”

larger residual, the worse the model is at predicting the outcome.

How well did you know this?

Not at all

Perfectly

What is homoscedasticity?

Variability. In linear regrression we assume constant variance. Regression residuals should show similar variance across all levels of fitted values.

How well did you know this?

Not at all

Perfectly

What do we mean by fitted values?

Values predicted by the regression model.

How well did you know this?

Not at all

Perfectly

What do we mean by independence of the model?

Residuals (errors) are independent of each other. Knowing the error of one data point should tell you nothing about the error of another.

How well did you know this?

Not at all

Perfectly

How do we check for linearity in a regression plot?

Look at the residual vs fitted values plot. If the line is mosty horizontal, we can assume linearity.

by default, plot describes the 3 largest absulte value residuals -
plot 1

How well did you know this?

Not at all

Perfectly

How do we check for homescedasticity (constant variance) in a model using residual vs fitted plot?

Looking for the residuals to be equally distributed from the line as fitted values increase
Problems if there is a “fan” shape (residuals increase or decrease as fitted increases)
Increasing residuals as fitted increase

Plot 1 (residual vs fitted) or Plot 3 (scale-location)

How well did you know this?

Not at all

Perfectly

How do we check for independence in a model?

Looking for random scatter of values
Problems if there is a pattern, trend, or clump of values
If data are measured over time, check especially for evidence of patterns that might suggest they are not independent (eg, plot residuals against time to look for patterns)

Plot 1

How well did you know this?

Not at all

Perfectly

How do we check for normality in a model?

Assess with Normal Q-Q plot (and possibly histogram) of the residuals
Settle for the residuals satisfying the “nearly Normal” condition
Problems if standardized residual values high (generally >2 or < -2)

Normality assumption becomes less important as the sample size grows

Plot 2

How well did you know this?

Not at all

Perfectly

How do you assess for collinearity of predictors?

Assess with scatterplot matrix and correlation coefficient of predictors
Also, after fitting model, look at variance inflation factors (VIF); values >5 suggest a concern
If collinearity a concern, then drop a predictor

How well did you know this?

Not at all

Perfectly

What is MAE?

mean absolute prediction error, lower is better

How well did you know this?

Not at all

Perfectly

What is RMSE?

root mean squared prediction error, lower is better

How well did you know this?

Not at all

Perfectly

What do AIC and BIC tell you in a model?

indexes where lower value = better model

Akaike Information Criterion and Bayesian Information Criterion

How well did you know this?

Not at all

Perfectly

What is Adjusted R2?

adjusted down for the number of covariates in the model; not a variation of anything but an index where higher is better

How well did you know this?

Not at all

Perfectly

What is Nominal R2?

Proportion of variation in the outcome explained by the model, higher is better

How well did you know this?

Not at all

Perfectly

Key things to determine which model is best?

Study These Flashcards

Nominal R2 = higher is better
Adjusted R2 = higher is better
AIC and BIC = lower is better
RMSE = lower is better
MAE = lower is better

Residuals in a model that meet all assumptions

Study These Flashcards

standardized residuals (in addition to following a Normal distribution in general) will also have mean zero and standard deviation 1, with approximately 95% of values between -2 and +2, and approximately 99.74% of values between -3 and +3

residual 113 and 72 are beyond these parameters, so we need to look at leverage/influence

How do we assess homoscedasticity using the scale-location plot?

Study These Flashcards

Plots the suqare rool of the standardized residuals agains fitted values
Looking for loess smooth to be flat; problem if a slope.

How do we assess leverage?

Study These Flashcards

A residuals vs Leverage plot or Cook’s distance plot
most highly leveraged points are shown furthest to the right in this plot (unusual predictor values)
Points with high influence are shown in the upper or lower right section outside of the dotted line (cook’s distance > .5

Plot 5

What can we do if assumption of linearity is not met in a linear regression model?

Study These Flashcards

o Look for potential transformation of outcome Y with Box-Cox plot
o Look at residual plots against each of the individual predictors
o This will help identify the specific predictor (or predictors) where we need to use a transformation
o If all predictors are problematic, we might transform the outcome Y
o If plot of x vs. y suggests non-linearity (curves) then consider a polynomial or spline for the predictor

What can we do if assumption of homoscedasticity is not met in a linear regression model?

Study These Flashcards

o Consider potential transformations of the predictors, possibly fitting some sort of polynomial term or cubic spline term in the predictors

What can we do if assumption of independence is not met in a linear regression model?

Study These Flashcards

o No specific solution outlined in the course notes or slides; probably just use a different model

What can we do if assumption of normality is not met in a linear regression model?

Study These Flashcards

Consider a transformation of the outcome Y
(use Box-Cox; ladder of power lambda = 1 = unchanged)

What transformation for may be useful for right-skewed data?

Study These Flashcards

consider power transformations below 1 - sqrt = 0.5, natural log = 0, inverse = -1

What transformation for may be useful for left-skewed data?

consider power transformations above 1 - square = 2, cube = 3

What transformation for may be useful for data with heavy (outlier-proned) or light tails (fewer extreme values than normal distribution)?

None. If a value has high influence (very different from rest of values) it may merit removal before modeling with explanation in our report. (i.e. This observation was included in the sample in error, for instance, because the subject was not eligible; An error was made in transcribing the value of this observation to the final data set; This observation is part of a meaningful subgroup of patients that I had always intended to model separately) | ‘I ran my model and it doesn’t fit this point well’ is NOT a good reason

What does skew in a normal Q-Q plot look like?

Need to include review of interpretation of a linear regression

You are building a linear regression model for a quantitative outcome called out with only a limited number of observations, and need to include four predictors: age (in years), prior (1 = had prior surgery, 0 = no prior surgery), severity (three categories: High, Medium, Low) and length (in centimeters). Suppose you are permitted to spend an additional four degrees of freedom beyond the five accounted for by the intercept term and the main effects of these four predictors. Based on the Q14 Spearman 𝜌2 Plot below, which of these models best does this additional spending?

i. The Spearman 𝜌2 plot pushes us to age and then severity as candidates. ii. age is a quantitative variable. iii. severity is a categorical variable with three categories. iv. To spend an additional four degrees of freedom, we can use a 4-knot spline in age (which adds 2 df) and an interaction of the linear age effect and severity (which also adds 2 df), as in option d.

Consider the information below on the distribution of a potential outcome variable in a linear regression model to be built. Based on the three pieces of output provided, which transformation of the outcome data would be most appropriate?

A logarithmic transformation is likely to be helpful. The Box-Cox plot likes power 𝜆 = 0, indicating a logarithm. There is right skew in the boxplot, and all of the values are strictly positive, with the mean well above the median.

* Suppose you want to include five predictor variables (dm, group, symptom, sys_bp, and tot_chol) in a linear model for result. dm and group are categorical variables with 2 and 4 categories, respectively. symptom, sys_bp, and tot_chol are continuous variables. * The main effects model including these five predictors uses 7 degrees of freedom. Suppose you are now willing to spend 12 df total (an additional 5 df in an augmented model). Use the Spearman rho-square plot to motivate your choices.

* Our augmented model should add 5 df to the main effects model, and the most logical choices are to add nonlinear terms to tot_chol and group according to the plot. * Therefore, we will add a 4-knot spline for tot_chol for 2 additional df (the spline uses K-1 = 3 df, or 2 more df than the main effects model). We will also add an interaction between group and the main effects of tot_chol for an additional 3 df (df1 × df2 = 3 × 1 = 3 more df).

Linear regression Flashcards

(31 cards)