Reading 1: Multiple Regression Flashcards

1
Q

What are the columns included in an ANOVA table?

What do they measure/show?

A
  • source of variation
  • degrees of freedom: The number of independent values that can vary in the calculation.
  • sum of squares: Measures the total variation in the data.
  • mean square: The average variation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is an ANOVA table used to calculate?

A

F-test and R^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the degrees of freedom for Regression, error and total in the ANOVA table?

A
  • regression = k
  • error = n-k-1
  • total = n-1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you calculate Regression Sum of Squares and what is its significance and usage?

A

= explained variation

Significance: SSR measures the variation explained by the regression model. It indicates how much of the total variation is accounted for by the model’s predictions.

**Usage: ** A higher SSR suggests that the model is effective in explaining the variability in the data. It is used to assess the model’s explanatory power.

Y estimated v.s. Y mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you calculate Error Sum of Squares and what is its significance and usage?

A

= unexplained variation

Significance: SSE measures the variation that is not explained by the regression model. It represents the residual or unexplained variability in the data.
Usage: A lower SSE indicates that the model’s predictions are closer to the actual data points, suggesting a better fit. It is used to evaluate the model’s accuracy.

Y estimated v.s. Y mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you calculate Total Sum of Squares and what is its significance and usage?

A

= explained variation + unexplained variation
= SST = RSS + SSE

**Significance: **SST represents the total variation in the observed data. It serves as a baseline measure of how much the actual data points deviate from the overall mean (sum of squared differences).

**Usage: **SST is used to quantify the total variability in the dataset before any model is applied.

Y actual v.s. Y mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is the Mean Sqaure calculated?

A

Sum of squares divided by degrees of freedom

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the formula to calculate R2 directly from the ANOVA table and what does it show?

A

R2 measures the % of total variation in the Y variable (dependent) explained by the X variable (independent)

= explained/total variation
or
= total variation - unexplained variation/total variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does an R2 of 0.25 mean?

A

X explains 25% of the variation in Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose of an adjusted R2 and how is it calculated?

A

Adjusted R2 applies a penalty fator to reflect the quality of added variables.
Too many expanatory x variables run the risk of trying to overexplain the data (explains randomness not true patterns) = poor forecasting.

formula: 1- (total df/unexplained df x 1-r2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is higher or lower better for the following:
* R2
* AIC
* BIC

A
  • r2 = higher
  • AIC and BIC = lower
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does AIC help to evaluate/when is it best used?

What is the formula? And what effect does k have?

A

AIC: if the purpose is prediction i.e. the goal is to have abetter forecast.

if k increases, AIC increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does BIC help to evaluate/when is it best used?

What is the formula? And what effect does k have compared to AIC?

A

BIC is preferred if goodness of fit is the goal. It imposes a higher penalty for overfitting, if K iincreases, BIC increases more than AIC.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of the F-statistic in nested models?

A

To determine if the simpler (nested) model is significantly different from the more complex (full) model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you calculate the F-statistic for nested models?

A
MSE = SSEunrestricted/n-k-1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the degrees of freedom for the F statistic nested model?

A

numerator df = q = number of excluded variables in the restricted model

denominator = n-k-1
k = number of independent variables in full model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the hypotheses for the F-statistic in nested models?

A

Null Hypothesis (H₀): The coefficients of the removed predictors are zero.
i.e. useless

Alternative Hypothesis (H₁): At least one of the removed predictors has a non-zero coefficient.
i.e. not useless and at least one variable is pulling their weight in explaining the var in y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the conclusion if F statistic > critical value?

A

Reject null, test is statistically significant.

Full model provides a significantly better fit than the nested model.

The relative decrease in SSE due to the inclusion of q additional variables is statistically justified i.e. improve the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the purpose of the F-statistic in assessing overall model fit?

A

To compare the fit of the regression model to a model with no predictors i.e. no slope coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you calculate the F-statistic for overall model fit?

A

unrestricted

MSR = SSR/K
MSE = SSE/n-k-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the hypotheses for the F-statistic in overall model fit?

A

Null Hypothesis (H₀): The model with no predictors fits the data as well as the regression model.
i.e. the slope co-efficients on all x variables in the unrestricted model = 0

Alternative Hypothesis (H₁): The regression model provides a better fit than the model with no predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does a significant F-statistic indicate in overall model fit?

A

It indicates that the regression model explains a substantial portion of the variance in the response variable.

22
Q

What are 4 model misspecifications?

A
  1. omitting a variable that should be included
  2. variable transformation i.e. for linearity
  3. inappropriate scaling of the variable
  4. incorrectly pooling data
23
Why might a variable need to be transformed for linearity? What assumptions may be violated?
A variable might need to be transformed to ensure the relationship between the predictor and response variable is linear. e.g. converting market cap to the log of market cap, logs makes it linear Violations: heteroskedasticity in the residuals Explanation: Transforming variables (e.g., using logarithms or square roots) can help linearize relationships, making the model more accurate and easier to interpret. Non-linear relationships can lead to poor model fit and misleading results.
23
What is the consequence of omitting a variable that should be included in a model? What assumptions may be violated?
Omitting a variable can lead to model misspecification, resulting in biased and inconsistent estimates. Potential violations: serial correlation or heteroskedasticity in the residuals Explanation: When a relevant variable is omitted, the model fails to account for its effect, which can distort the relationships between the included variables and the response variable. This can lead to incorrect conclusions and predictions.
23
What is the impact of inappropriate scaling of a variable? What assumptions may be violated?
Inappropriate scaling can affect the model's accuracy and interpretability. e.g. using number of free float shares rather than proprtion Potential violations: heteroskedasticity/multicollinearity Explanation: Variables should be scaled appropriately to ensure they contribute correctly to the model. Incorrect scaling can lead to disproportionate influence of certain variables, skewing the results and making the model less reliable.
24
What does incorrectly pooling data mean? What assumptions may be violated?
Incorrectly pooling data refers to combining data from different regimes or contexts without accounting for their differences. e.g. difference beween pre and post covid/GFC Potential violations: serial correlation or heteroskedasticity in the residuals Explanation: Pooling data from different regimes can lead to misleading results, as the underlying relationships may differ across contexts. It's important to account for these differences to ensure the model accurately reflects the data.
24
what is heteroskedasticity? how many types are there?
Heteroskedasticity occurs when the variance of the errors in a regression model is not constant There are two types: conditional and unconditional. Conditional is problematic as relates to independent variables
25
What is the effect of heteroskedasticity on regression output? + the effect on financial data
T and F stats (hypothesis tests and confidence intervals) become unreliable. Slope co-efficient estimates are not affected, however, the standard error becomes unreliable. For financial data, most likely the standard errors are understated and t stats inflated (too high causing type 1 errors) Explanation: When heteroskedasticity is present, the ordinary least squares (OLS) estimates remain unbiased, but they are no longer efficient.
26
What does heteroskedasticity look like on a graph?
27
How do we detect heteroskedasticity?
* Scatter diagram: plot residual against each independent variable and against time. e.g. as variable gets larger, error term gets larger, should be randomly distributed around x variable. Breusch Pagan test: regress squared residuals on X variables. * test residuals of residuals of resulting r2 (do the independent variables explain a significant part of the variation in squared residuals?) * H0 = no heteroskedasticity * chi square test: BP = nx R of residuals ^2 (with k df) * if BP > critical value reject the null and conclude you have a problem.
28
What is the effect of serial correlation on regression output?
T and F stats (hypothesis tests and confidence intervals) become unreliable. inefficient estimates and biased standard errors. Slope co-efficient estimates are not affected, however, the standard error becomes unreliable. Positive serial correlation: Standard error is too low. (t stat too high) Negative serial correlation: standard error is too high (t stat too low) Explanation: When serial correlation is present, the ordinary least squares (OLS) estimates remain unbiased, but they are no longer efficient. This means that the standard errors of the coefficients are incorrect, leading to unreliable hypothesis tests and confidence intervals.
28
What is serial correlation?
Explanation: In a regression model, serial correlation means that the errors from one observation are related to the errors from another observation, violating the assumption of independence. Answer: Serial correlation, also known as autocorrelation, occurs when the residuals (errors) in a regression model are correlated across observations. e.g. the residual in the current period is positive and the probability of the residiual in the next period being positive is greater than 50%
29
How do we detect serial correlation?
Answer: Serial correlation can be detected using graphical methods (e.g., residual plots/scatter) and statistical tests (e.g., Durbin-Watson test, Breusch-Godfrey test). Durbin-Watson: tests one lag Breusch-Godfrey: tests several lags, uses residuals as the y variable. residuals are run against initial regressors plus lagged residuals * F distribution * p (numerator)and n-p-k-1 (denominator) dof
30
How do you correct for serial corr/heteroskedasticity?
* use robus standard errors Newey West corrected standard errors fro serial correlation White-corrected standard errors for conditional heteroskedasticity
30
What is multicollinearity?
Answer: Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, making it difficult to isolate their individual effects on the response variable. e.g. if 4 friends are pushing a car when it breaks down, who is doing most of the work Explanation: When predictors are highly correlated, it becomes challenging to determine the unique contribution of each predictor to the response variable, leading to issues in the regression analysis.
31
What is the effect of multicollinearity on regression output?
Answer: Multicollinearity can lead to inflated standard errors, unreliable coefficient estimates, reduceses T stats, increases chance of Type II errors (X variables seem less valuable because they are sharing credit with other variables) where t stats are artificially small, variables look falesly unimportant. Explanation: High correlation among predictors can cause instability in the coefficient estimates, making them sensitive to small changes in the model. This results in large standard errors and unreliable hypothesis tests.
31
How do we detect multicollinearity?
* significant F stat (low p value f stats < 0.05 = significant) (and high r2), but all t stats/p values insignificant (p value > 0.05 = insiginificant) * high correlation between x variables (k=2 case only) * High Variance Inflation Factor (VIF). VIF = 1 = no correlation VIF > 5 = further investigation VIF > 10 = SERIOUS multicollinerarity needs correction. VIF = 1/(1-r2)
32
How do we correct for multicollinearity?
* Remove one or more regression variables * use different proxy for one of the variables e.g. liquidity can be bid ask instead of free float * increase the sample size, more statistically robust
33
What two types of observations can influence regression results?
high leverage point - obs with extreme independent/ x var outlier - obs with extreme dependent/ y var
34
What is leverage? When is an observation considered influential?
Standardised measure of distance of observation j from the mean and takes on a value between 0 and 1. 3 x (k+1/n) --> if leverage is greater than this, the observation is potentially influential k is the number of independent variables
35
What are studentised residuals? How do they work?
Measure for identifying an outlier. Delete observation j, estimate reression model using n-1 observations. Estimate y hat and ej then calculate studentised ej for each observation in dataset. critical value acts as a ceiling, if the absolute value of studentised residual is greater than the t value REJECT. (doesnt matter if positive or negative - two tail t test. rejected = outlier degrees of freedom for critical value n-k-2
36
What are Dummy Variables?
Purpose: They allow categorical variables (like gender, region, or type) to be included in regression models. Representation: Each category is represented by a binary variable (0 or 1).
37
How many dummy variabales are used and why?
n-1 to avoid multicollinerarity
38
What is an interecpt dummy, how does it work?
Purpose: Adjust the intercept of the regression model for different categories of a categorical variable. D either equals 0 or 1. If 0, whole term = 0, if 1 whole term = b1 How It Works: Each dummy variable shifts the intercept of the regression line for its respective category. The coefficients of intercept dummy variables represent the difference in the intercept for each category compared to the reference group.
39
What is a slope dummy? How does it work?
Purpose: Adjust the slope of the regression model for different categories of a categorical variable. DX captures the change in the slope on account of the dummy variable. How It Works: Each dummy variable interacts with a continuous predictor to change the slope of the regression line for its respective category. The coefficients of slope dummy variables represent the difference in the slope for each category compared to the reference group.
40
What is a logistic regression model?
A logistic regression model is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. i.e. failure, success or increase, decrease They estimate the probaility (log odds) of an event based on the logistic distribution.
41
What is the formula to convert the probability of an event to odds?
42
how do you calculate the probability once you have the estimated y variable?
43
How should you interpret the slope-coefficients for logit models?
The coefficients (beta) represent the change in the log-odds of the outcome for a one-unit increase in the x variable.
44
interpret this model predicting the probability of passing an exam based on study hours and attendance:
Intercept (b_0 = -2): The log-odds of passing the exam when study hours and attendance are zero. Study Hours (b_1 = 0.05): For each additional hour of study, the log-odds of passing the exam increase by 0.05. Odds Ratio: e^{0.05} \approx 1.051. Each additional hour of study multiplies the odds of passing by approximately 1.051. Attendance (b_2 = 0.3): For each additional unit of attendance, the log-odds of passing the exam increase by 0.3. Odds Ratio: e^{0.3} \approx 1.35. Each additional unit of attendance multiplies the odds of passing by approximately 1.35. Positive Coefficient: Indicates an increase in the log-odds (and thus the odds) of the outcome. Negative Coefficient: Indicates a decrease in the log-odds (and thus the odds) of the outcome.
45
What is pseudo r2 used for?
to evaluate competing models with the same dependent variable. higher value = better fit
46
how do you work out the probability given the co-efficient of the intercept
e to the co-efficient will convert the log of odds to odds. p/(1+p) will convert to probability
47
for logit regression, when do you reject the null?
when the p value is smaller than the critical value/alpha. means it is statistically significant
48
What is a likelihood ratio? How does it work?
A likelihood ratio is a statistical measure used to compare the goodness of fit between two models. In the context of regression, it helps determine whether a more complex model significantly improves the fit of the data compared to a simpler model. Chi square distribution with q dof. q = omitted variables in the restricted model 1 tail test reject null if chi square > critical value. means omitted values are useless, do not add to explanatory power (= far from 0) LR = -2 (log likelihood restricted model - log liklihood unrestricted model) log likelihood metric = negative higher values = better fitting model