Reading 1: Multiple Regression Flashcards
What are the columns included in an ANOVA table?
What do they measure/show?
- source of variation
- degrees of freedom: The number of independent values that can vary in the calculation.
- sum of squares: Measures the total variation in the data.
- mean square: The average variation.
What is an ANOVA table used to calculate?
F-test and R^2
What are the degrees of freedom for Regression, error and total in the ANOVA table?
- regression = k
- error = n-k-1
- total = n-1
How do you calculate Regression Sum of Squares and what is its significance and usage?
= explained variation
Significance: SSR measures the variation explained by the regression model. It indicates how much of the total variation is accounted for by the model’s predictions.
**Usage: ** A higher SSR suggests that the model is effective in explaining the variability in the data. It is used to assess the model’s explanatory power.
Y estimated v.s. Y mean
How do you calculate Error Sum of Squares and what is its significance and usage?
= unexplained variation
Significance: SSE measures the variation that is not explained by the regression model. It represents the residual or unexplained variability in the data.
Usage: A lower SSE indicates that the model’s predictions are closer to the actual data points, suggesting a better fit. It is used to evaluate the model’s accuracy.
Y estimated v.s. Y mean
How do you calculate Total Sum of Squares and what is its significance and usage?
= explained variation + unexplained variation
= SST = RSS + SSE
**Significance: **SST represents the total variation in the observed data. It serves as a baseline measure of how much the actual data points deviate from the overall mean (sum of squared differences).
**Usage: **SST is used to quantify the total variability in the dataset before any model is applied.
Y actual v.s. Y mean
How is the Mean Sqaure calculated?
Sum of squares divided by degrees of freedom
What is the formula to calculate R2 directly from the ANOVA table and what does it show?
R2 measures the % of total variation in the Y variable (dependent) explained by the X variable (independent)
= explained/total variation
or
= total variation - unexplained variation/total variation
What does an R2 of 0.25 mean?
X explains 25% of the variation in Y
What is the purpose of an adjusted R2 and how is it calculated?
Adjusted R2 applies a penalty fator to reflect the quality of added variables.
Too many expanatory x variables run the risk of trying to overexplain the data (explains randomness not true patterns) = poor forecasting.
formula: 1- (total df/unexplained df x 1-r2)
Is higher or lower better for the following:
* R2
* AIC
* BIC
- r2 = higher
- AIC and BIC = lower
What does AIC help to evaluate/when is it best used?
What is the formula? And what effect does k have?
AIC: if the purpose is prediction i.e. the goal is to have abetter forecast.
if k increases, AIC increases
What does BIC help to evaluate/when is it best used?
What is the formula? And what effect does k have compared to AIC?
BIC is preferred if goodness of fit is the goal. It imposes a higher penalty for overfitting, if K iincreases, BIC increases more than AIC.
What is the purpose of the F-statistic in nested models?
To determine if the simpler (nested) model is significantly different from the more complex (full) model.
How do you calculate the F-statistic for nested models?
What are the degrees of freedom for the F statistic nested model?
numerator df = q = number of excluded variables in the restricted model
denominator = n-k-1
k = number of independent variables in full model
What are the hypotheses for the F-statistic in nested models?
Null Hypothesis (H₀): The coefficients of the removed predictors are zero.
i.e. useless
Alternative Hypothesis (H₁): At least one of the removed predictors has a non-zero coefficient.
i.e. not useless and at least one variable is pulling their weight in explaining the var in y
What is the conclusion if F statistic > critical value?
Reject null, test is statistically significant.
Full model provides a significantly better fit than the nested model.
The relative decrease in SSE due to the inclusion of q additional variables is statistically justified i.e. improve the model.
What is the purpose of the F-statistic in assessing overall model fit?
To compare the fit of the regression model to a model with no predictors i.e. no slope coefficients.
How do you calculate the F-statistic for overall model fit?
unrestricted
MSR = SSR/K
MSE = SSE/n-k-1
What are the hypotheses for the F-statistic in overall model fit?
Null Hypothesis (H₀): The model with no predictors fits the data as well as the regression model.
i.e. the slope co-efficients on all x variables in the unrestricted model = 0
Alternative Hypothesis (H₁): The regression model provides a better fit than the model with no predictors.
What does a significant F-statistic indicate in overall model fit?
It indicates that the regression model explains a substantial portion of the variance in the response variable.
What are 4 model misspecifications?
- omitting a variable that should be included
- variable transformation i.e. for linearity
- inappropriate scaling of the variable
- incorrectly pooling data