Linear Regression Flashcards

1
Q

What does ‘strength of a relationship’ in regression refer to?

A

It is an indication of how well one can predict the response variable (e.g., sales) from the predictor (e.g., advertising budget). A strong relationship implies high predictive accuracy, whereas a weak relationship implies a prediction only slightly better than random guessing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Definition: Simple Linear Regression

A

A linear model with one predictor (X) used to predict an outcome (Y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does the symbol ‘≈’ mean in a regression/statistical context?

A

It can be read as ‘is approximately modeled as,’ indicating an approximate relationship rather than an exact equality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

sales ≈ β0 + β1 × TV

What does β0 represent in this equation?

A

Intercept

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Definition: Intercept (β0)

A

Represents the predicted value of Y when X=0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

sales ≈ β0 + β1 × TV

What does β1 represent in this equation?

A

Slope

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Definition: Slope (β1)

A

Represents the average change in Y for a one-unit increase in X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

sales ≈ β0 + β1 × TV

What are terms used to refer to β0 and β1 collectively?

A
  • Coefficients
  • Parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Ordinary Least Squares (OLS) Estimation

A

A method to estimate β0 and β1 by minimizing the sum of squared residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Residual (ε)

A

ei = yi −yˆi

The difference between an observed value (Y) and the model’s fitted value (Ŷ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Residual sum of squares (RSS) equation

Include simple form and full form

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Equation for slope (β1)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Equation for intercept (β0)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Definition: least squares coefficient estimates

A

They are the intercept and slope estimates chosen to minimize the sum of squared residuals (differences between observed and predicted values), providing the best linear fit to the data under the least squares criterion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Best-Fit Line

A

The linear function Ŷ = β₀ + β₁X that minimizes the sum of squared residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Interpretation of β1 in Linear Regression

Hint: Consider MLR as well as SLR

A

Indicates how much Y is expected to change when X increases by one unit, holding other factors constant (if any).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Assumption: Linearity

A

Y is assumed to be linearly related to X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Assumption: Independence of Errors

A

The residuals are assumed to be uncorrelated with one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Assumption: Exogeneity

A

The error term or residuals are independent of X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Assumption: Homoscedasticity

A

The variance of residuals is constant across all values of X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Assumption: Normality of Errors

A

Residuals are assumed to follow a normal distribution (especially important for inference).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Definition: Population regression line

A

The population regression line is the true (but typically unknown) underlying linear relationship between X and Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Definition: least squares line

A

The least squares line is our estimated linear relationship based on a specific sample of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the distinction between the least squares line and population regression line?

A

Different samples yield slightly different least squares lines, but the population regression line remains fixed (and unobserved).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is an unbiased estimator in statistics?
An unbiased estimator is one whose expected value equals the true parameter across many samples, meaning it does not systematically over- or under-estimate the parameter.
26
Are least squares estimates unbiased?
Yes. If we repeatedly draw different samples and compute the least squares estimates, the average of those estimates will equal the true coefficients. Hence they do not systematically over- or under-estimate the true parameters.
27
What is the standard error of an estimator?
It is a measure of the estimator’s variability—how far the estimator (e.g., a sample mean or a regression coefficient) is likely to deviate from the true parameter value on average.
28
Formula for the variance of the sample mean μ̂
Var(μ̂) = SE(μ̂)² = σ² / n
29
# Var(μ̂) = SE(μ̂)² = σ² / n What are the conditions under which the variance of the sample mean μ̂ holds?
Independent and identically distributed (i.i.d.) with finite variance
30
# Var(μ̂) = SE(μ̂)² = σ² / n What does the variance equation for the sample mean μ̂ tell us?
The variability of the sample mean decreases as sample size grows
31
Formula for Var(β̂₀) in simple linear regression
SE(β̂₀)² = σ² * [ 1/n + ( x̄² / Σᵢ (xᵢ - x̄)² ) ]
32
Formula for Var(β̂₁) in simple linear regression
SE(β̂₁)² = σ² / Σᵢ (xᵢ - x̄)²
33
# SE(β̂₁)² = σ² / Σᵢ (xᵢ - x̄)² What does the variance equation for β̂₁ in simple linear regression tell us?
SE(β̂₁) is smaller when the xᵢ are more spread out; intuitively we have more leverage to estimate a slope when this is the case
34
What is a confidence interval?
It is a range of values that, with a specified level of confidence (e.g. 95%), is expected to contain the true (but unknown) parameter.
35
What is the approximate 95% confidence interval for β₁ in simple linear regression? How is it approximate?
β̂₁ ± 2 · SE(β̂₁). (Strictly speaking, we use the t-distribution quantile with n−2 degrees of freedom, but 2 is a close approximation.)
36
What is the null hypothesis (H₀) when testing for a relationship between X and Y?
H₀: β₁ = 0 (no relationship between X and Y).
37
What is the alternative hypothesis (Hₐ) when testing for a relationship between X and Y?
Hₐ: β₁ ≠ 0 (some relationship between X and Y).
38
What distribution does the test statistic, (β̂₁ - 0) / SE(β̂₁), follow when testing for a relationship between X and Y?
t-distribution with n - 2 degrees of freedom.
39
How do we compute the t-statistic for β₁ when testing for a relationship between X and Y?
t = (β̂₁ - 0) / SE(β̂₁), which measures how many standard deviations β̂₁ is away from zero.
40
What does the p-value represent in the context of testing for a relationship between X and Y?
It is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. A small p-value suggests that β₁ ≠ 0.
41
How do we typically decide to reject H₀?
If the p-value is below a chosen significance level (e.g., 0.05), we reject H₀ and conclude there is likely a relationship between X and Y.
42
Using Linear Models for Inference
We can make statements about the relationship between X and Y (e.g., whether β₁ ≠ 0) based on statistical tests.
43
Definition: Residual Standard Error (RSE)
An estimate of the standard deviation of the error terms in a regression model, measuring how far observed values typically deviate from the true regression line.
44
Formula: RSE | Simple linear regression
RSE = sqrt( (1/(n - 2)) * Σ( yᵢ - ŷᵢ )² ), where n is the number of observations.
45
Definition: Total Sum of Squares (TSS)
Represents the total variability in the response variable Y before regression.
46
Formula: TSS
TSS = Σ( yᵢ - ȳ )²
47
# Include equation Definition: Residual Sum of Squares (RSS)
RSS = Σ( yᵢ - ŷᵢ )². It measures the variability in Y left unexplained by the regression model.
48
Definition: R² Statistic
R² measures the proportion of variability in Y that is explained by the model; it always lies between 0 and 1.
49
Formula: R²
R² = 1 - (RSS / TSS). It compares unexplained variability to total variability in the data.
50
Why might we use R² instead of RSE?
R² is a scale-free measure of the proportion of variance in the response explained by the model, always lying between 0 and 1. RSE, in contrast, is on the scale of Y and can be harder to interpret across different contexts.
51
What is considered a 'good' R²?
It depends on the context and field of application. In some physical sciences, values near 1 might be realistic. In many social or biological settings, much lower R² values (e.g. 0.1 or 0.2) may still be considered informative.
52
Definition: Correlation between X and Y | Include a verbal explanation of how it's computed
A measure of the linear relationship between X and Y, computed as the covariance of X and Y divided by the product of their standard deviations.
53
Formula: Correlation between X and Y
Cor(X,Y) = (∑(xᵢ - x̄)(yᵢ - ȳ)) / √[∑(xᵢ - x̄)² * ∑(yᵢ - ȳ)²]
54
Relationship: R² and correlation in simple linear regression
In a simple linear regression with one predictor, R² equals the square of the correlation between X and Y.
55
Definition: F-Statistic (for regression)
A ratio of explained variance to unexplained variance, used to test whether at least one predictor is significantly related to the response.
56
What is Multiple Linear Regression (MLR)?
A statistical technique for modeling the relationship between one response (dependent) variable and multiple predictor (independent) variables, using a linear function.
57
What is the general MLR model equation?
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε, where Y is the response, Xᵢ are predictors, βᵢ are unknown coefficients, and ε is the error term.
58
How are the coefficients in MLR typically estimated?
By minimizing the Residual Sum of Squares (RSS) = Σ(yᵢ - ŷᵢ)², where ŷᵢ is the model’s predicted value for observation i.
59
# Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε What does βⱼ represent in MLR?
βⱼ represents the estimated change in the response Y for a one-unit change in Xⱼ, holding all other predictors constant.
60
Why might multiple predictors be used instead of just one?
Including additional relevant predictors often improves predictions and reveals more nuanced relationships, controlling for the effects of other variables.
61
Why might a predictor appear significant when analyzed alone but not in a multiple regression?
Because in simple regression we do not control for the effects of other predictors. Once we include additional variables, the apparent significance can disappear if the predictor’s effect was actually due to correlation with those other predictors.
62
What are some important questions we may seek to answer with MLR? | 4 questions
1. Is at least one of the predictors X1, X2,...,Xp useful in predicting the response? 2. Do all the predictors help to explain Y , or is only a subset of the predictors useful? 3. How well does the model fit the data? 4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
63
Definition: RSE in MLR
An estimate of the standard deviation of the error terms, measuring how far observed values typically deviate from the fitted regression hyperplane. It quantifies the average unexplained variability per observation.
64
Formula: RSE in MLR
RSE = √[ RSS / (n - p - 1) ], where p is the number of predictors and n is the sample size
65
Definition: Multiple R²
It is the proportion of variability in the response Y explained by the model. It ranges from 0 to 1.
66
Formula: Multiple R²
R² = 1 - (RSS / TSS)
67
Why might we use Adjusted R² instead of R² in MLR?
R² can increase by simply adding more predictors, even if they are only marginally useful. Adjusted R² penalizes for extra predictors, preventing misleadingly high R² values.
68
Formula: Adjusted R²
Adjusted R² = 1 - [ (RSS / (n - p - 1)) / (TSS / (n - 1)) ]
69
What is the overall F-test in MLR?
A hypothesis test checking whether at least one of the predictors has a non-zero coefficient. H₀: all βⱼ = 0 vs. Hₐ: at least one βⱼ ≠ 0.
70
# F = [ (TSS - RSS)/p ] / [ RSS/(n - p - 1) ] What does F-statistic in MLR tell you?
A large F suggests that the model with predictors explains significantly more variance than a model with no predictors.
71
Formula: F-statistic in MLR
F = [ (TSS - RSS)/p ] / [ RSS/(n - p - 1) ]
72
What distribution does the F-statistic follow in multiple linear regression?
Under the classical assumptions (normal errors, etc.), the F-statistic follows an F-distribution with p and n−p−1 degrees of freedom.
73
The F-statistic in multiple linear regression follows an F distribution under the assumption that the errors ε_i have a normal distribution. Does this still hold if the errors are not perfectly normal?
Yes, if the sample size n is sufficiently large, the F-statistic is approximately F-distributed due to asymptotic robustness, even if the errors deviate from normality.
74
What is a partial F-test in multiple linear regression?
It compares a 'full' model (with all predictors) to a 'reduced' model (omitting a subset of q predictors), determining whether those q predictors significantly improve the fit.
75
What is the formula for the partial F-statistic?
F = [ (RSS₀ - RSS) / q ] / [ RSS / (n - p - 1) ], where RSS₀ is the RSS of the reduced model, RSS is the RSS of the full model, and p is the total number of predictors in the full model.
76
How do we interpret the partial F-test result?
A large F (with a small p-value) indicates that dropping the q predictors increases the residual error enough to conclude those predictors matter. If F is near 1, there's little evidence that the omitted predictors improve the model.
77
What do the individual t-tests check in MLR?
They test whether each coefficient βⱼ is significantly different from zero, holding the other predictors constant.
78
What does p-value mean in the context of each predictor’s t-test?
It is the probability, under the null hypothesis (βⱼ = 0), of observing a test statistic at least as extreme as the one computed from the data.
79
Why do we look at the overall F-statistic rather than just individual t-tests?
Because with many predictors (p large), some t-tests may be significant by chance (false positives). The F-statistic adjusts for the number of predictors, so the probability of incorrectly rejecting H₀ remains at the chosen significance level (e.g. 5%), regardless of how many predictors there are.
80
When can the usual F-statistic not be used in multiple linear regression?
If the number of predictors (p) exceeds the number of observations (n), we cannot fit the model using ordinary least squares — there are not enough degrees of freedom — and thus we cannot perform the usual F-test. Specialized high-dimensional methods are required instead.
81
What are the three classical approaches for variable selection in MLR?
1. Forward selection 2. Backward selection 3. Mixed selection
82
What is forward selection?
A stepwise approach that starts with the null model (only an intercept), then adds one predictor at a time — whichever reduces RSS the most — until a stopping criterion is reached.
83
What is backward selection?
A stepwise approach that begins with all predictors in the model and removes the predictor with the largest p-value at each step, continuing until a stopping criterion is met.
84
What is mixed (stepwise) selection?
A combination of forward and backward selection. Start with no predictors, adding the best one at a time (like forward), but also remove any predictors that have become insignificant (like backward), iterating until no more improvements can be made.
85
What does 'controlling for other variables' mean?
In MLR, the coefficient βⱼ reflects the effect of Xⱼ on Y after accounting for (holding constant) all other included predictors.
86
How do we interpret a negative coefficient for a predictor in MLR?
It indicates that, after holding other predictors constant, increases in that predictor are associated with a decrease in the response.
87
When do we reject the null hypothesis in the overall F-test?
If the F-statistic is sufficiently large (or the corresponding p-value is sufficiently small), indicating that at least one predictor is significantly related to Y.
88
What is the difference between the overall F-test and individual t-tests?
The F-test checks if any predictor is relevant, while t-tests check if each specific predictor’s coefficient differs from zero, given the others in the model.
89
What two primary metrics are used to assess model fit in multiple linear regression?
The Residual Standard Error (RSE) and R² (proportion of variance explained).
90
Why does R² always increase (or stay the same) when new predictors are added?
Because adding predictors can only reduce (or leave unchanged) the Residual Sum of Squares (RSS), thereby increasing R²—even if those predictors are only weakly related to the response.
91
How can adding a predictor sometimes increase RSE even though RSS decreases?
RSE = √[RSS / (n - p - 1)]. If the drop in RSS is not large enough to offset the increase in p (number of predictors), then the denominator shrinks faster than RSS, causing RSE to increase.
92
What does the 3D plot of TV, radio, and sales suggest?
It indicates a non-linear pattern in the residuals, implying that a simple linear model may underestimate sales in certain regions (e.g., where budgets are split), suggesting possible interaction or synergy between TV and radio advertising.
93
What is meant by a 'synergy' or 'interaction effect' between predictors?
An effect in which combining multiple predictors (e.g., TV and radio advertising) yields a greater (or different) impact on the response than the sum of their individual effects alone.
94
What are the three main sources of uncertainty in multiple regression predictions?
1) Uncertainty in the coefficient estimates (reducible error). 2) Model bias if the linear form is not exactly correct. 3) Irreducible error due to random variation in the outcome.
95
How does a confidence interval differ from a prediction interval in MLR?
A confidence interval targets the average response for given predictor values, while a prediction interval encompasses the possible range for a single future observation. Prediction intervals are always wider.
96
Why do we call the linear model an approximation of reality?
Real relationships can be more complex or nonlinear. The chosen linear form introduces 'model bias' if it doesn't capture all the underlying structure.
97
Why is the prediction interval wider than the confidence interval?
Because the prediction interval includes both uncertainty in estimating the mean response (reducible error) and the additional variability of an individual outcome (irreducible error).
98
How do confidence intervals for βⱼ in MLR differ from simple linear regression?
They use the same logic (estimate ± critical value × SE), but the SE takes into account correlations among predictors and the degrees of freedom for MLR.
99
What is multicollinearity in MLR?
A situation where two or more predictors are highly correlated, making it difficult to distinguish their individual effects on the response.
100
Why is multicollinearity problematic?
It inflates the variance of the coefficient estimates, leading to unstable estimates and wider confidence intervals (less precision).
101
What is the role of qualitative (categorical) predictors in MLR?
They are included via dummy (indicator) variables that take on values 0/1, allowing the model to estimate different intercepts for each category.
102
What is one-hot encoding?
A method for handling qualitative (categorical) predictors by creating separate indicator (dummy) variables for each category, each taking values of 0 or 1.
103
Given that y_i is the credit card balance for an individual, what can β_0 be interpreted as?
The average credit card balance for individuals from the East.
104
Given that y_i is the credit card balance, what can β_1 be interpreted as?
The difference in the average balance between people from the South versus the East.
105
Given that y_i is the credit card balance, what can β_2 be interpreted as?
The difference in the average balance between those from the West versus the East.
106
Why is there always one fewer dummy variable than the number of levels for a categorical variable?
Because one category serves as the 'baseline' (or reference), and the remaining categories each get a dummy variable (1/0). Having a dummy for every category would cause perfect multicollinearity.
107
How can we test whether a categorical variable with multiple levels (e.g., region) is related to the response?
We perform an F-test of the joint hypothesis that all corresponding dummy coefficients (e.g., β₁ = β₂ = 0) are zero. This test does not depend on which category is chosen as the baseline.
108
What is the additive assumption in linear models?
It states that each predictor's effect on the response is independent of the other predictors, so the impact of one predictor does not change based on the value of another predictor.
109
What is an interaction term in MLR?
An additional predictor created by multiplying two predictors (e.g., X₁ × X₂), allowing the effect of one predictor to depend on the level of another.
110
What does it mean for an interaction term to be signficant? In other words, what do you interpret a significant interaction term to mean?
If an interaction is significant, it means the relationship between one predictor and Y changes depending on the value of another predictor.
111
What is a 'main effect' in a regression model?
It's the direct effect of a single predictor on the response, not accounting for interaction terms with other predictors.
112
What is the hierarchical principle in regression?
It states that if an interaction (or higher-order) term is included in a model, then the corresponding lower-order (main) effects should also be included, even if they appear statistically insignificant.
113
Why follow the hierarchical principle?
Because including an interaction X₁×X₂ without the main effects X₁ and X₂ confounds the interpretation. The interaction term can absorb the baseline effect of X₁ or X₂, and its coefficient becomes misleading. Keeping main effects clarifies the unique impact of the interaction.
114
What happens when we add an interaction between a qualitative variable (e.g., student status) and a quantitative variable (e.g., income)?
It allows each group (e.g., students vs. non-students) to have not only its own intercept but also its own slope with respect to the quantitative variable, rather than forcing parallel lines.
115
What does the model look like for an interaction between one binary dummy (student) and a numeric predictor (income)? Response variable is balance.
balance = β₀ + β₁ × income + β₂ × student + β₃ × (income × student). For students: (β₀ + β₂) + (β₁ + β₃) × income; for non-students: β₀ + β₁ × income.
116
What is the 'linearity assumption' in linear regression?
The change in the response Y associated with a one-unit change in Xj is constant, regardless of the value of Xj.
117
What is polynomial regression?
A way to capture non-linear relationships by including polynomial terms (e.g., X², X³) of a predictor in a linear model. The model remains 'linear' in parameters, but can curve with respect to X.
118
Why might we add a squared (X²) term to a regression?
If the data suggests a curved (non-linear) relationship between X and Y, including X² can significantly improve the fit by allowing the model to bend.
119
Does adding polynomial terms still produce a linear model?
Yes. Even though the predictors include X² or X³, the model is linear in the coefficients (β’s), so standard linear regression software can fit it.
120
What potential pitfall arises from adding too many polynomial terms?
The model can become overly 'wiggly' and may overfit the data, adding complexity without a genuine improvement in predictive or explanatory power.
121
What are the most common problems when fitting a linear regression model?
1. Non-linearity of the response-predictor relationships 2. Correlation of error terms 3. Non-contant variance of error terms 4. Outliers 5. High-leverage points 6. Collinearity
122
# Most common problems when fitting linear regression model What is the non-linearity issue?
Linear regression assumes a straight-line relationship between predictors and the response. If the true relationship is curved (non-linear), the model’s predictions and inferences can be inaccurate.
123
What is a residual plot?
* In the case of a simple linear regression model, involves plotting residuals (e_i = y_i - ˆy_i) versus the predictor x_i. * In the case of a multiple regression model, involves plotting residuals (e_i = y_i - ˆy_i) versus predicted/fitted values ˆyi.
124
# Most common problems when fitting linear regression model How can we detect non-linearity in a regression model?
By examining residual plots. If a clear pattern (e.g., a U-shape) appears in the residuals versus fitted values, that suggests the linear model is missing a non-linear component.
125
# Most common problems when fitting linear regression model What steps can be taken if we detect non-linearity?
One simple approach is to include polynomial (e.g., X²) or other transformations (e.g., log X, √X) of the predictors. More advanced non-linear models can also be used.
126
# Most common problems when fitting linear regression model What does 'correlation of error terms' in linear regression mean?
It means the residuals are not independent—there is some systematic relationship among them, often seen in time series or clustered data.
127
# Most common problems when fitting linear regression model Why is correlated error structure a problem?
Because many standard tests (e.g., t-tests, F-tests) assume independent errors. Correlated errors lead to incorrect, low/narrow estimates of standard errors, confidence intervals, and p-values.
128
# Most common problems when fitting linear regression model How can we detect correlated errors?
By plotting residuals against time or their lagged values, or using specific tests like the Durbin–Watson test for autocorrelation.
129
# Most common problems when fitting linear regression model What is meant by 'non-constant variance' of error terms?
Also called heteroscedasticity, it occurs when the spread (variance) of the residuals changes for different fitted values, violating the usual linear model assumption that Var(εᵢ) = σ².
130
# Most common problems when fitting linear regression model How can we detect heteroscedasticity?
By examining residual plots: if residuals increase or decrease systematically with fitted values (e.g., funnel shape), that suggests non-constant variance.
131
# Most common problems when fitting linear regression model What are common remedies for heteroscedasticity?
Transform the response using a concave function like log(Y) or √Y, or use weighted least squares, which gives lower weight to observations with higher variance.
132
# Most common problems when fitting linear regression model Why is non-constant variance problematic?
Standard errors, confidence intervals, and p-values from ordinary least squares become unreliable if the variance of the errors is not constant.
133
# Most common problems when fitting linear regression model What is an outlier in linear regression?
A data point whose observed value is far from the value predicted by the model, resulting in a large residual compared to other observations.
134
# Most common problems when fitting linear regression model Why can an outlier be problematic even if it doesn't dramatically change the slope?
Because a single extreme point can inflate the Residual Standard Error (RSE) and affect confidence intervals and p-values, potentially distorting inferences about the model.
135
# Most common problems when fitting linear regression model How do we identify outliers?
By examining residual plots or studentized residuals. A studentized residual (residual divided by its estimated standard error) greater than about ±3 is often considered outlying.
136
# Most common problems when fitting linear regression model What are possible actions if an outlier is identified?
1) Check for data-entry errors or measurement anomalies. 2) Remove it if it’s clearly erroneous. 3) Keep it if it’s a valid data point and consider whether it indicates missing predictors or model mis-specification.
137
# Most common problems when fitting linear regression model What are high-leverage points in linear regression?
Observations whose predictor values (X’s) are unusual or far from the bulk of the data. They can have a large influence on the fitted model, even if their residuals aren’t large.
138
# Most common problems when fitting linear regression model Why are high-leverage points potentially problematic?
Because they can disproportionately affect the regression coefficients. A single high-leverage observation can pull the fitted line or plane toward itself, distorting results.
139
# Most common problems when fitting linear regression model Are high-leverage points always outliers?
No. A high-leverage point can have a small residual if the model is forced to pass near it. Conversely, an outlier has a large residual but might not have unusual X-values.
140
# Most common problems when fitting linear regression model How can we detect high-leverage points?
By calculating leverage scores. Observations with hᵢ significantly larger than the average leverage (p+1)/n are considered high leverage.
141
# Most common problems when fitting linear regression model Equation: Leverage statistic (hᵢ) for simple linear regression
hᵢ = 1/n + ( (xᵢ - x̄)² / Σᵢ(xᵢ - x̄)² )
142
# Most common problems when fitting linear regression model Why is it important to check both residuals and leverage?
Because outliers are identified via large residuals, while high-leverage points have unusual predictor values. A data point can be high leverage, an outlier, both, or neither.
143
# Most common problems when fitting linear regression model What is collinearity (multicollinearity) in linear regression?
It refers to predictors that are highly correlated with each other, making it hard to determine their individual effects on the response.
144
# Most common problems when fitting linear regression model Why is collinearity problematic?
Because it inflates the standard errors of the coefficient estimates, potentially making significant predictors appear insignificant and leading to unstable estimates.
145
# Most common problems when fitting linear regression model How can we detect collinearity?
By examining the correlation matrix among predictors or by calculating Variance Inflation Factors (VIF). Large VIF values (e.g., > 5 or 10) suggest serious multicollinearity.
146
# Most common problems when fitting linear regression model What is the Variance Inflation Factor (VIF)?
A measure of how much the variance of a coefficient is inflated due to collinearity with other predictors.
147
# Most common problems when fitting linear regression model Equation: Variance Inflation Factor (VIF)
VIFᵢ = 1 / (1 - Rᵢ²), where Rᵢ² is the R² from regressing predictor i on the other predictors.
148
# Most common problems when fitting linear regression model What strategies can address collinearity?
Remove or combine highly correlated predictors.
149
# Most common problems when fitting linear regression model Does collinearity always ruin the model?
Not necessarily. The model can still predict well, but interpreting individual coefficients becomes difficult if their estimates have large standard errors due to high collinearity.
150
What method(s) can you use to answer the question: 'Is there a relationship between sales and advertising budget?'
Fit a multiple regression model and conduct a hypothesis test (F-test) to see if the slope differs from zero.
151
What method(s) can you use to answer the question: 'How strong is the relationship (between sales and advertising budget)?'
Look at the Residual Standard Error (RSE) to gauge the average prediction error, and the R² statistic to see what fraction of the variance in sales is explained by the advertising budget. A lower RSE and higher R² both indicate a stronger relationship.
152
What method(s) can you use to answer the question: 'Which media are associated with sales?'
Fit a multiple linear regression model including all media as predictors, then check each predictor’s t-statistic and p-value. Predictors with low p-values are significantly related to sales.
153
What method(s) can you use to answer the question: 'How large is the association between each medium and sales?'
Construct confidence intervals for each medium’s regression coefficient (βᵢ) in a multiple linear regression model. The size and position of these intervals relative to zero indicate how large (and significant) each medium’s effect is.
154
What method(s) can you use to answer the question: 'How accurately can we predict future sales?'
Use the fitted regression model to generate either a confidence interval for the mean response (if predicting the average) or a prediction interval (if predicting an individual outcome). Prediction intervals are wider because they account for the irreducible error term.
155
What method(s) can you use to answer the question: 'Is the relationship linear?'
Create and inspect residual plots to see if there is a systematic pattern (indicating non-linearity). If a pattern emerges, consider adding polynomial or transformed predictors to handle non-linear effects.
156
What method(s) can you use to answer the question: 'Is there synergy among the advertising media?'
Include an interaction term (e.g., TV × radio) in a multiple regression model, then check if the coefficient (and its p-value) is significant. A significant interaction term suggests synergy among the media.
157
What is the difference between parametric and non-parametric methods in regression?
Parametric methods (like linear regression) assume a functional form for f(X), with a fixed number of parameters. Non-parametric methods (like K-Nearest Neighbors) do not assume a specific form.
158
What is K-Nearest Neighbors (KNN) regression?
A non-parametric technique that predicts a new observation’s response by averaging the responses of its K closest training points in predictor space.
159
How is the prediction f-hat(x0) computed in KNN regression?
f-hat(x0) = (1/K) * Σ(y_i for x_i in N0)
160
How does K affect the bias-variance trade-off in KNN?
A small K yields more flexible fits (low bias, but high variance). A large K yields smoother fits (higher bias, but lower variance), diluting local idiosyncrasies.
161
In what setting will a parametric approach such as least squares linear regression outperform a non-parametric approach such as KNN regression?
The parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form of f.
162
When is it helpful to use KNN rather than a linear model?
When the true relationship is highly non-linear or too complex for a simple parametric form. KNN can adapt more flexibly to such data if enough observations are available.
163
What is the main advantage of KNN regression over linear regression?
It can capture complex, non-linear relationships without specifying a model form. Linear regression may miss these if the linear (or polynomial) form is too restrictive.
164
What are 2 disadvantages of KNN regression compared to linear regression?
* KNN can underperform in higher-dimentional problems due to the curse of dimensionality. As the number of predictors grows, data become sparse and points end up far from each other, making it hard to find truly 'nearby neighbors.' * KNN provides less interpretability — there are no explicit coefficients to explain predictor effects.
165
How do you fit a linear model in R with `medv` as the response and `lstat` as the predictor using the Boston data?
`lm.fit <- lm(medv ~ lstat, data = Boston)`
166
How do you extract the coefficients of a linear regression model in R?
`coef(lm.fit)`
167
How do you calculate confidence intervals for the coefficients of a linear regression model in R?
`confint(lm.fit)`
168
How do you produce confidence intervals for new data in R using a linear regression model?
`predict(lm.fit, newdata, interval = 'confidence')`
169
How do you produce prediction intervals for new data in R using a linear regression model?
`predict(lm.fit, newdata, interval = 'prediction')`
170
How do you plot the data with the linear model fit in R for `lstat` vs. `medv`?
``` plot(lstat, medv) abline(lm.fit) ```
171
How do you display the standard diagnostic plots for a linear model in R?
``` par(mfrow = c(2, 2))) plot(lm.fit) ```
172
How do you plot residuals vs. fitted values in R?
`plot(predict(lm.fit), residuals(lm.fit))`
173
How do you plot studentized residuals vs. predicted values in R?
`plot(predict(lm.fit), rstudent(lm.fit))`
174
How do you plot leverage statistics in R?
`plot(hatvalues(lm.fit))`
175
How do you fit a linear model in R with `medv` as the response and `lstat` and `age` as predictors using the Boston data?
`lm.fit <- lm(medv ~ lstat + age, data = Boston)`
176
How do you fit a linear model in R with `medv` as the response and all other variables as predictors using the Boston data?
`lm.fit <- lm(medv ~ ., data = Boston)`
177
How do you calculate the variance inflation factor (VIF) in R?
``` library(car) vif(lm.fit) ```
178
How do you fit a linear model in R with `medv` as the response and all other variables except `age` as predictors using the Boston data?
`lm.fit1 <- lm(medv ∼ . - age, data = Boston)`
179
How do you modify an existing R model using the `update()` function?
Use `update()` with a new formula that references the old formula. For example, `update(lm.fit, ~ . - age)` removes the `age` predictor while keeping all other terms.
180
# Example interaction between `lstat` and `age` What does the colon (:) syntax do for interactions in R?
Using `lstat:age` includes only the interaction term between `lstat` and `age` (no main effects).
181
# Example interaction between `lstat` and `age` What does the star (*) syntax do for interactions in R?
Using `lstat * age` expands to `lstat + age + lstat:age`, meaning it includes both main effects and the interaction.
182
How do you fit a linear model in R with `medv` as the response and `lstat` as the predictor, with a quadratic `lstat` term, using the Boston data?
`lm.fit2 <- lm(medv ∼ lstat + I(lstat^2))`
183
What is the purpose of the `anova()` function when comparing nested linear models in R?
It performs a hypothesis test to see if the more complex model significantly improves the fit compared to the simpler (nested) model.
184
How do you compare two nested models with `anova()` in R?
Call `anova(model1, model2)` where `model1` is the simpler model and `model2` is the extended model. The function returns an F-statistic and p-value for the comparison.
185
How do you include higher-order polynomial terms in a linear model in R without manually specifying each power?
Use the `poly()` function. For example: `lm(y ~ poly(x, 5))` fits a 5th-order polynomial in x.
186
What is the difference between `poly(x, 3)` and `poly(x, 3, raw = TRUE)`?
`poly(x, 3)` uses orthogonal polynomials (less correlation, more stable estimates), while `raw = TRUE` produces raw powers of x (x, x², x³). Both yield the same fitted values but have different coefficient estimates.
187
How do you fit a linear model in R with `medv` as the response and the logarithm of `lstat` as the predictor using the Boston data?
`lm.fit <- lm(medv ∼ log(rm), data = Boston)`
188
How does R handle qualitative variables in a linear regression model by default?
R automatically creates dummy variables for each factor level (except the baseline), allowing regression coefficients to compare each category to the baseline.
189
What does the `contrasts()` function do in R?
It shows (and can set) the coding scheme for factor variables (i.e., how factor levels map to dummy variables). For example, `contrasts(ShelveLoc)` displays the dummy coding.