Linear Regression Flashcards

Question

What is an unbiased estimator in statistics?

Answer 1

An unbiased estimator is one whose expected value equals the true parameter across many samples, meaning it does not systematically over- or under-estimate the parameter.

Answer 2

Yes. If we repeatedly draw different samples and compute the least squares estimates, the average of those estimates will equal the true coefficients. Hence they do not systematically over- or under-estimate the true parameters.

Answer 3

It is a measure of the estimator’s variability—how far the estimator (e.g., a sample mean or a regression coefficient) is likely to deviate from the true parameter value on average.

Answer 4

Var(μ̂) = SE(μ̂)² = σ² / n

Answer 5

Independent and identically distributed (i.i.d.) with finite variance

Answer 6

The variability of the sample mean decreases as sample size grows

Answer 7

SE(β̂₀)² = σ² * [ 1/n + ( x̄² / Σᵢ (xᵢ - x̄)² ) ]

Answer 8

SE(β̂₁)² = σ² / Σᵢ (xᵢ - x̄)²

Answer 9

SE(β̂₁) is smaller when the xᵢ are more spread out; intuitively we have more leverage to estimate a slope when this is the case

Answer 10

It is a range of values that, with a specified level of confidence (e.g. 95%), is expected to contain the true (but unknown) parameter.

Answer 11

β̂₁ ± 2 · SE(β̂₁). (Strictly speaking, we use the t-distribution quantile with n−2 degrees of freedom, but 2 is a close approximation.)

Answer 12

H₀: β₁ = 0 (no relationship between X and Y).

Answer 13

Hₐ: β₁ ≠ 0 (some relationship between X and Y).

Answer 14

t-distribution with n - 2 degrees of freedom.

Answer 15

t = (β̂₁ - 0) / SE(β̂₁), which measures how many standard deviations β̂₁ is away from zero.

Answer 16

It is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. A small p-value suggests that β₁ ≠ 0.

Answer 17

If the p-value is below a chosen significance level (e.g., 0.05), we reject H₀ and conclude there is likely a relationship between X and Y.

Answer 18

We can make statements about the relationship between X and Y (e.g., whether β₁ ≠ 0) based on statistical tests.

Answer 19

An estimate of the standard deviation of the error terms in a regression model, measuring how far observed values typically deviate from the true regression line.

Answer 20

RSE = sqrt( (1/(n - 2)) * Σ( yᵢ - ŷᵢ )² ), where n is the number of observations.

Answer 21

Represents the total variability in the response variable Y before regression.

Answer 22

TSS = Σ( yᵢ - ȳ )²

Answer 23

RSS = Σ( yᵢ - ŷᵢ )². It measures the variability in Y left unexplained by the regression model.

Answer 24

R² measures the proportion of variability in Y that is explained by the model; it always lies between 0 and 1.

Answer 25

R² = 1 - (RSS / TSS). It compares unexplained variability to total variability in the data.

Answer 26

R² is a scale-free measure of the proportion of variance in the response explained by the model, always lying between 0 and 1. RSE, in contrast, is on the scale of Y and can be harder to interpret across different contexts.

Answer 27

It depends on the context and field of application. In some physical sciences, values near 1 might be realistic. In many social or biological settings, much lower R² values (e.g. 0.1 or 0.2) may still be considered informative.

Answer 28

A measure of the linear relationship between X and Y, computed as the covariance of X and Y divided by the product of their standard deviations.

Answer 29

Cor(X,Y) = (∑(xᵢ - x̄)(yᵢ - ȳ)) / √[∑(xᵢ - x̄)² * ∑(yᵢ - ȳ)²]

Answer 30

In a simple linear regression with one predictor, R² equals the square of the correlation between X and Y.

Answer 31

A ratio of explained variance to unexplained variance, used to test whether at least one predictor is significantly related to the response.

Answer 32

A statistical technique for modeling the relationship between one response (dependent) variable and multiple predictor (independent) variables, using a linear function.

Answer 33

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε, where Y is the response, Xᵢ are predictors, βᵢ are unknown coefficients, and ε is the error term.

Answer 34

By minimizing the Residual Sum of Squares (RSS) = Σ(yᵢ - ŷᵢ)², where ŷᵢ is the model’s predicted value for observation i.

Answer 35

βⱼ represents the estimated change in the response Y for a one-unit change in Xⱼ, holding all other predictors constant.

Answer 36

Including additional relevant predictors often improves predictions and reveals more nuanced relationships, controlling for the effects of other variables.

Answer 37

Because in simple regression we do not control for the effects of other predictors. Once we include additional variables, the apparent significance can disappear if the predictor’s effect was actually due to correlation with those other predictors.

Answer 38

1. Is at least one of the predictors X1, X2,...,Xp useful in predicting the response? 2. Do all the predictors help to explain Y , or is only a subset of the predictors useful? 3. How well does the model fit the data? 4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Answer 39

An estimate of the standard deviation of the error terms, measuring how far observed values typically deviate from the fitted regression hyperplane. It quantifies the average unexplained variability per observation.

Answer 40

RSE = √[ RSS / (n - p - 1) ], where p is the number of predictors and n is the sample size

Answer 41

It is the proportion of variability in the response Y explained by the model. It ranges from 0 to 1.

Answer 42

R² = 1 - (RSS / TSS)

Answer 43

R² can increase by simply adding more predictors, even if they are only marginally useful. Adjusted R² penalizes for extra predictors, preventing misleadingly high R² values.

Answer 44

Adjusted R² = 1 - [ (RSS / (n - p - 1)) / (TSS / (n - 1)) ]

Answer 45

A hypothesis test checking whether at least one of the predictors has a non-zero coefficient. H₀: all βⱼ = 0 vs. Hₐ: at least one βⱼ ≠ 0.

Answer 46

A large F suggests that the model with predictors explains significantly more variance than a model with no predictors.

Answer 47

F = [ (TSS - RSS)/p ] / [ RSS/(n - p - 1) ]

Answer 48

Under the classical assumptions (normal errors, etc.), the F-statistic follows an F-distribution with p and n−p−1 degrees of freedom.

Answer 49

Yes, if the sample size n is sufficiently large, the F-statistic is approximately F-distributed due to asymptotic robustness, even if the errors deviate from normality.

Answer 50

It compares a 'full' model (with all predictors) to a 'reduced' model (omitting a subset of q predictors), determining whether those q predictors significantly improve the fit.

Answer 51

F = [ (RSS₀ - RSS) / q ] / [ RSS / (n - p - 1) ], where RSS₀ is the RSS of the reduced model, RSS is the RSS of the full model, and p is the total number of predictors in the full model.

Answer 52

A large F (with a small p-value) indicates that dropping the q predictors increases the residual error enough to conclude those predictors matter. If F is near 1, there's little evidence that the omitted predictors improve the model.

Answer 53

They test whether each coefficient βⱼ is significantly different from zero, holding the other predictors constant.

Answer 54

It is the probability, under the null hypothesis (βⱼ = 0), of observing a test statistic at least as extreme as the one computed from the data.

Answer 55

Because with many predictors (p large), some t-tests may be significant by chance (false positives). The F-statistic adjusts for the number of predictors, so the probability of incorrectly rejecting H₀ remains at the chosen significance level (e.g. 5%), regardless of how many predictors there are.

Answer 56

If the number of predictors (p) exceeds the number of observations (n), we cannot fit the model using ordinary least squares — there are not enough degrees of freedom — and thus we cannot perform the usual F-test. Specialized high-dimensional methods are required instead.

Answer 57

1. Forward selection 2. Backward selection 3. Mixed selection

Answer 58

A stepwise approach that starts with the null model (only an intercept), then adds one predictor at a time — whichever reduces RSS the most — until a stopping criterion is reached.

Answer 59

A stepwise approach that begins with all predictors in the model and removes the predictor with the largest p-value at each step, continuing until a stopping criterion is met.

Answer 60

A combination of forward and backward selection. Start with no predictors, adding the best one at a time (like forward), but also remove any predictors that have become insignificant (like backward), iterating until no more improvements can be made.

Answer 61

In MLR, the coefficient βⱼ reflects the effect of Xⱼ on Y after accounting for (holding constant) all other included predictors.

Answer 62

It indicates that, after holding other predictors constant, increases in that predictor are associated with a decrease in the response.

Answer 63

If the F-statistic is sufficiently large (or the corresponding p-value is sufficiently small), indicating that at least one predictor is significantly related to Y.

Answer 64

The F-test checks if any predictor is relevant, while t-tests check if each specific predictor’s coefficient differs from zero, given the others in the model.

Answer 65

The Residual Standard Error (RSE) and R² (proportion of variance explained).

Answer 66

Because adding predictors can only reduce (or leave unchanged) the Residual Sum of Squares (RSS), thereby increasing R²—even if those predictors are only weakly related to the response.

Answer 67

RSE = √[RSS / (n - p - 1)]. If the drop in RSS is not large enough to offset the increase in p (number of predictors), then the denominator shrinks faster than RSS, causing RSE to increase.

Answer 68

It indicates a non-linear pattern in the residuals, implying that a simple linear model may underestimate sales in certain regions (e.g., where budgets are split), suggesting possible interaction or synergy between TV and radio advertising.

Answer 69

An effect in which combining multiple predictors (e.g., TV and radio advertising) yields a greater (or different) impact on the response than the sum of their individual effects alone.

Answer 70

1) Uncertainty in the coefficient estimates (reducible error). 2) Model bias if the linear form is not exactly correct. 3) Irreducible error due to random variation in the outcome.

Answer 71

A confidence interval targets the average response for given predictor values, while a prediction interval encompasses the possible range for a single future observation. Prediction intervals are always wider.

Answer 72

Real relationships can be more complex or nonlinear. The chosen linear form introduces 'model bias' if it doesn't capture all the underlying structure.

Answer 73

Because the prediction interval includes both uncertainty in estimating the mean response (reducible error) and the additional variability of an individual outcome (irreducible error).

Answer 74

They use the same logic (estimate ± critical value × SE), but the SE takes into account correlations among predictors and the degrees of freedom for MLR.

Answer 75

A situation where two or more predictors are highly correlated, making it difficult to distinguish their individual effects on the response.

Answer 76

It inflates the variance of the coefficient estimates, leading to unstable estimates and wider confidence intervals (less precision).

Answer 77

They are included via dummy (indicator) variables that take on values 0/1, allowing the model to estimate different intercepts for each category.

Answer 78

A method for handling qualitative (categorical) predictors by creating separate indicator (dummy) variables for each category, each taking values of 0 or 1.

Answer 79

The average credit card balance for individuals from the East.

Answer 80

The difference in the average balance between people from the South versus the East.

Answer 81

The difference in the average balance between those from the West versus the East.

Answer 82

Because one category serves as the 'baseline' (or reference), and the remaining categories each get a dummy variable (1/0). Having a dummy for every category would cause perfect multicollinearity.

Answer 83

We perform an F-test of the joint hypothesis that all corresponding dummy coefficients (e.g., β₁ = β₂ = 0) are zero. This test does not depend on which category is chosen as the baseline.

Answer 84

It states that each predictor's effect on the response is independent of the other predictors, so the impact of one predictor does not change based on the value of another predictor.

Answer 85

An additional predictor created by multiplying two predictors (e.g., X₁ × X₂), allowing the effect of one predictor to depend on the level of another.

Answer 86

If an interaction is significant, it means the relationship between one predictor and Y changes depending on the value of another predictor.

Answer 87

It's the direct effect of a single predictor on the response, not accounting for interaction terms with other predictors.

Answer 88

It states that if an interaction (or higher-order) term is included in a model, then the corresponding lower-order (main) effects should also be included, even if they appear statistically insignificant.

Answer 89

Because including an interaction X₁×X₂ without the main effects X₁ and X₂ confounds the interpretation. The interaction term can absorb the baseline effect of X₁ or X₂, and its coefficient becomes misleading. Keeping main effects clarifies the unique impact of the interaction.

Answer 90

It allows each group (e.g., students vs. non-students) to have not only its own intercept but also its own slope with respect to the quantitative variable, rather than forcing parallel lines.

Answer 91

balance = β₀ + β₁ × income + β₂ × student + β₃ × (income × student). For students: (β₀ + β₂) + (β₁ + β₃) × income; for non-students: β₀ + β₁ × income.

Answer 92

The change in the response Y associated with a one-unit change in Xj is constant, regardless of the value of Xj.

Answer 93

A way to capture non-linear relationships by including polynomial terms (e.g., X², X³) of a predictor in a linear model. The model remains 'linear' in parameters, but can curve with respect to X.

Answer 94

If the data suggests a curved (non-linear) relationship between X and Y, including X² can significantly improve the fit by allowing the model to bend.

Answer 95

Yes. Even though the predictors include X² or X³, the model is linear in the coefficients (β’s), so standard linear regression software can fit it.

Answer 96

The model can become overly 'wiggly' and may overfit the data, adding complexity without a genuine improvement in predictive or explanatory power.

Answer 97

1. Non-linearity of the response-predictor relationships 2. Correlation of error terms 3. Non-constant variance of error terms 4. Outliers 5. High-leverage points 6. Collinearity

Answer 98

Linear regression assumes a straight-line relationship between predictors and the response. If the true relationship is curved (non-linear), the model’s predictions and inferences can be inaccurate.

Answer 99

* In the case of a simple linear regression model, involves plotting residuals (e_i = y_i - ˆy_i) versus the predictor x_i. * In the case of a multiple regression model, involves plotting residuals (e_i = y_i - ˆy_i) versus predicted/fitted values ˆyi.

Answer 100

By examining residual plots. If a clear pattern (e.g., a U-shape) appears in the residuals versus fitted values, that suggests the linear model is missing a non-linear component.

Answer 101

One simple approach is to include polynomial (e.g., X²) or other transformations (e.g., log X, √X) of the predictors. More advanced non-linear models can also be used.

Answer 102

It means the residuals are not independent—there is some systematic relationship among them, often seen in time series or clustered data.

Answer 103

Because many standard tests (e.g., t-tests, F-tests) assume independent errors. Correlated errors lead to incorrect, low/narrow estimates of standard errors, confidence intervals, and p-values.

Answer 104

By plotting residuals against time or their lagged values, or using specific tests like the Durbin–Watson test for autocorrelation.

Answer 105

Also called heteroscedasticity, it occurs when the spread (variance) of the residuals changes for different fitted values, violating the usual linear model assumption that Var(εᵢ) = σ².

Answer 106

By examining residual plots: if residuals increase or decrease systematically with fitted values (e.g., funnel shape), that suggests non-constant variance.

Answer 107

Transform the response using a concave function like log(Y) or √Y, or use weighted least squares, which gives lower weight to observations with higher variance.

Answer 108

Standard errors, confidence intervals, and p-values from ordinary least squares become unreliable if the variance of the errors is not constant.

Answer 109

A data point whose observed value is far from the value predicted by the model, resulting in a large residual compared to other observations.

Answer 110

Because a single extreme point can inflate the Residual Standard Error (RSE) and affect confidence intervals and p-values, potentially distorting inferences about the model.

Answer 111

By examining residual plots or studentized residuals. A studentized residual (residual divided by its estimated standard error) greater than about ±3 is often considered outlying.

Answer 112

1) Check for data-entry errors or measurement anomalies. 2) Remove it if it’s clearly erroneous. 3) Keep it if it’s a valid data point and consider whether it indicates missing predictors or model mis-specification.

Answer 113

Observations whose predictor values (X’s) are unusual or far from the bulk of the data. They can have a large influence on the fitted model, even if their residuals aren’t large.

Answer 114

Because they can disproportionately affect the regression coefficients. A single high-leverage observation can pull the fitted line or plane toward itself, distorting results.

Answer 115

No. A high-leverage point can have a small residual if the model is forced to pass near it. Conversely, an outlier has a large residual but might not have unusual X-values.

Answer 116

By calculating leverage scores. Observations with hᵢ significantly larger than the average leverage (p+1)/n are considered high leverage.

Answer 117

hᵢ = 1/n + ( (xᵢ - x̄)² / Σᵢ(xᵢ - x̄)² )

Answer 118

Because outliers are identified via large residuals, while high-leverage points have unusual predictor values. A data point can be high leverage, an outlier, both, or neither.

Answer 119

It refers to predictors that are highly correlated with each other, making it hard to determine their individual effects on the response.

Answer 120

Because it inflates the standard errors of the coefficient estimates, potentially making significant predictors appear insignificant and leading to unstable estimates.

Answer 121

By examining the correlation matrix among predictors or by calculating Variance Inflation Factors (VIF). Large VIF values (e.g., > 5 or 10) suggest serious multicollinearity.

Answer 122

A measure of how much the variance of a coefficient is inflated due to collinearity with other predictors.

Answer 123

VIFᵢ = 1 / (1 - Rᵢ²), where Rᵢ² is the R² from regressing predictor i on the other predictors.

Answer 124

Remove or combine highly correlated predictors.

Answer 125

Not necessarily. The model can still predict well, but interpreting individual coefficients becomes difficult if their estimates have large standard errors due to high collinearity.

Answer 126

Fit a multiple regression model and conduct a hypothesis test (F-test) to see if the slope differs from zero.

Answer 127

Look at the Residual Standard Error (RSE) to gauge the average prediction error, and the R² statistic to see what fraction of the variance in sales is explained by the advertising budget. A lower RSE and higher R² both indicate a stronger relationship.

Answer 128

Fit a multiple linear regression model including all media as predictors, then check each predictor’s t-statistic and p-value. Predictors with low p-values are significantly related to sales.

Answer 129

Construct confidence intervals for each medium’s regression coefficient (βᵢ) in a multiple linear regression model. The size and position of these intervals relative to zero indicate how large (and significant) each medium’s effect is.

Answer 130

Use the fitted regression model to generate either a confidence interval for the mean response (if predicting the average) or a prediction interval (if predicting an individual outcome). Prediction intervals are wider because they account for the irreducible error term.

Answer 131

Create and inspect residual plots to see if there is a systematic pattern (indicating non-linearity). If a pattern emerges, consider adding polynomial or transformed predictors to handle non-linear effects.

Answer 132

Include an interaction term (e.g., TV × radio) in a multiple regression model, then check if the coefficient (and its p-value) is significant. A significant interaction term suggests synergy among the media.

Answer 133

Parametric methods (like linear regression) assume a functional form for f(X), with a fixed number of parameters. Non-parametric methods (like K-Nearest Neighbors) do not assume a specific form.

Answer 134

A non-parametric technique that predicts a new observation’s response by averaging the responses of its K closest training points in predictor space.

Answer 135

f-hat(x0) = (1/K) * Σ(y_i for x_i in N0)

Answer 136

A small K yields more flexible fits (low bias, but high variance). A large K yields smoother fits (higher bias, but lower variance), diluting local idiosyncrasies.

Answer 137

The parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form of f.

Answer 138

When the true relationship is highly non-linear or too complex for a simple parametric form. KNN can adapt more flexibly to such data if enough observations are available.

Answer 139

It can capture complex, non-linear relationships without specifying a model form. Linear regression may miss these if the linear (or polynomial) form is too restrictive.

Answer 140

* KNN can underperform in higher-dimentional problems due to the curse of dimensionality. As the number of predictors grows, data become sparse and points end up far from each other, making it hard to find truly 'nearby neighbors.' * KNN provides less interpretability — there are no explicit coefficients to explain predictor effects.

Answer 141

`lm.fit <- lm(medv ~ lstat, data = Boston)`

Answer 142

`coef(lm.fit)`

Answer 143

`confint(lm.fit)`

Answer 144

`predict(lm.fit, newdata, interval = 'confidence')`

Answer 145

`predict(lm.fit, newdata, interval = 'prediction')`

Answer 146

``` plot(lstat, medv) abline(lm.fit) ```

Answer 147

``` par(mfrow = c(2, 2))) plot(lm.fit) ```

Answer 148

`plot(predict(lm.fit), residuals(lm.fit))`

Answer 149

`plot(predict(lm.fit), rstudent(lm.fit))`

Answer 150

`plot(hatvalues(lm.fit))`

Answer 151

`lm.fit <- lm(medv ~ lstat + age, data = Boston)`

Answer 152

`lm.fit <- lm(medv ~ ., data = Boston)`

Answer 153

``` library(car) vif(lm.fit) ```

Answer 154

`lm.fit1 <- lm(medv ∼ . - age, data = Boston)`

Answer 155

Use `update()` with a new formula that references the old formula. For example, `update(lm.fit, ~ . - age)` removes the `age` predictor while keeping all other terms.

Answer 156

Using `lstat:age` includes only the interaction term between `lstat` and `age` (no main effects).

Answer 157

Using `lstat * age` expands to `lstat + age + lstat:age`, meaning it includes both main effects and the interaction.

Answer 158

`lm.fit2 <- lm(medv ∼ lstat + I(lstat^2))`

Answer 159

It performs a hypothesis test to see if the more complex model significantly improves the fit compared to the simpler (nested) model.

Answer 160

Call `anova(model1, model2)` where `model1` is the simpler model and `model2` is the extended model. The function returns an F-statistic and p-value for the comparison.

Answer 161

Use the `poly()` function. For example: `lm(y ~ poly(x, 5))` fits a 5th-order polynomial in x.

Answer 162

`poly(x, 3)` uses orthogonal polynomials (less correlation, more stable estimates), while `raw = TRUE` produces raw powers of x (x, x², x³). Both yield the same fitted values but have different coefficient estimates.

Answer 163

`lm.fit <- lm(medv ∼ log(rm), data = Boston)`

Answer 164

R automatically creates dummy variables for each factor level (except the baseline), allowing regression coefficients to compare each category to the baseline.

Answer 165

It shows (and can set) the coding scheme for factor variables (i.e., how factor levels map to dummy variables). For example, `contrasts(ShelveLoc)` displays the dummy coding.

Linear Regression Flashcards

(189 cards)