Reading 1: Multiple Regression Flashcards

Question

Why might a variable need to be transformed for linearity? What assumptions may be violated?

Answer 1

A variable might need to be transformed to ensure the relationship between the predictor and response variable is linear. e.g. converting market cap to the log of market cap, logs makes it linear Violations: heteroskedasticity in the residuals Explanation: Transforming variables (e.g., using logarithms or square roots) can help linearize relationships, making the model more accurate and easier to interpret. Non-linear relationships can lead to poor model fit and misleading results.

Answer 2

Omitting a variable can lead to model misspecification, resulting in biased and inconsistent estimates. Potential violations: serial correlation or heteroskedasticity in the residuals Explanation: When a relevant variable is omitted, the model fails to account for its effect, which can distort the relationships between the included variables and the response variable. This can lead to incorrect conclusions and predictions.

Answer 3

Inappropriate scaling can affect the model's accuracy and interpretability. e.g. using number of free float shares rather than proprtion Potential violations: heteroskedasticity/multicollinearity Explanation: Variables should be scaled appropriately to ensure they contribute correctly to the model. Incorrect scaling can lead to disproportionate influence of certain variables, skewing the results and making the model less reliable.

Answer 4

Incorrectly pooling data refers to combining data from different regimes or contexts without accounting for their differences. e.g. difference beween pre and post covid/GFC Potential violations: serial correlation or heteroskedasticity in the residuals Explanation: Pooling data from different regimes can lead to misleading results, as the underlying relationships may differ across contexts. It's important to account for these differences to ensure the model accurately reflects the data.

Answer 5

Heteroskedasticity occurs when the variance of the errors in a regression model is not constant There are two types: conditional and unconditional. Conditional is problematic as relates to independent variables

Answer 6

T and F stats (hypothesis tests and confidence intervals) become unreliable. Slope co-efficient estimates are not affected, however, the standard error becomes unreliable. For financial data, most likely the standard errors are understated and t stats inflated (too high causing type 1 errors) Explanation: When heteroskedasticity is present, the ordinary least squares (OLS) estimates remain unbiased, but they are no longer efficient.

Answer 7

* Scatter diagram: plot residual against each independent variable and against time. e.g. as variable gets larger, error term gets larger, should be randomly distributed around x variable. Breusch Pagan test: regress squared residuals on X variables. * test residuals of residuals of resulting r2 (do the independent variables explain a significant part of the variation in squared residuals?) * H0 = no heteroskedasticity * chi square test: BP = n x R of residuals ^2 (with k df) * The BP test statistic is calculated as n X R2, where n is the number of observations and R2 is from the regression for the BP test. * if BP > critical value reject the null and conclude you have a problem.

Answer 8

T and F stats (hypothesis tests and confidence intervals) become unreliable. inefficient estimates and biased standard errors. Slope co-efficient estimates are not affected, however, the standard error becomes unreliable. Positive serial correlation: Standard error is too low. (t stat too high) Negative serial correlation: standard error is too high (t stat too low) Explanation: When serial correlation is present, the ordinary least squares (OLS) estimates remain unbiased, but they are no longer efficient. This means that the standard errors of the coefficients are incorrect, leading to unreliable hypothesis tests and confidence intervals.

Answer 9

Explanation: In a regression model, serial correlation means that the errors from one observation are related to the errors from another observation, violating the assumption of independence. Answer: Serial correlation, also known as autocorrelation, occurs when the residuals (errors) in a regression model are correlated across observations. e.g. the residual in the current period is positive and the probability of the residiual in the next period being positive is greater than 50%

Answer 10

Answer: Serial correlation can be detected using graphical methods (e.g., residual plots/scatter) and statistical tests (e.g., Durbin-Watson test, Breusch-Godfrey test). Durbin-Watson: tests one lag Breusch-Godfrey: tests several lags, uses residuals as the y variable. residuals are run against initial regressors plus lagged residuals * F distribution * p (numerator)and n-p-k-1 (denominator) dof

Answer 11

* use robus standard errors Newey West corrected standard errors fro serial correlation White-corrected standard errors for conditional heteroskedasticity

Answer 12

Answer: Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, making it difficult to isolate their individual effects on the response variable. e.g. if 4 friends are pushing a car when it breaks down, who is doing most of the work Explanation: When predictors are highly correlated, it becomes challenging to determine the unique contribution of each predictor to the response variable, leading to issues in the regression analysis.

Answer 13

Answer: Multicollinearity can lead to inflated standard errors, unreliable coefficient estimates, reduceses T stats, increases chance of Type II errors (X variables seem less valuable because they are sharing credit with other variables) where t stats are artificially small, variables look falesly unimportant. Explanation: High correlation among predictors can cause instability in the coefficient estimates, making them sensitive to small changes in the model. This results in large standard errors and unreliable hypothesis tests.

Answer 14

* significant F stat (low p value f stats < 0.05 = significant) (and high r2), but all t stats/p values insignificant (p value > 0.05 = insiginificant) * high correlation between x variables (k=2 case only) * High Variance Inflation Factor (VIF). VIF = 1 = no correlation VIF > 5 = further investigation VIF > 10 = SERIOUS multicollinerarity needs correction. VIF = 1/(1-r2)

Answer 15

* Remove one or more regression variables * use different proxy for one of the variables e.g. liquidity can be bid ask instead of free float * increase the sample size, more statistically robust

Answer 16

high leverage point - obs with extreme independent/ x var outlier - obs with extreme dependent/ y var

Answer 17

Standardised measure of distance of observation j from the mean and takes on a value between 0 and 1. 3 x (k+1/n) --> if leverage is greater than this, the observation is potentially influential k is the number of independent variables

Answer 18

Measure for identifying an outlier. Delete observation j, estimate reression model using n-1 observations. Estimate y hat and ej then calculate studentised ej for each observation in dataset. critical value acts as a ceiling, if the absolute value of studentised residual is greater than the t value REJECT. (doesnt matter if positive or negative - two tail t test. rejected = outlier degrees of freedom for critical value n-k-2

Answer 19

Purpose: They allow categorical variables (like gender, region, or type) to be included in regression models. Representation: Each category is represented by a binary variable (0 or 1).

Answer 20

n-1 to avoid multicollinerarity

Answer 21

Purpose: Adjust the intercept of the regression model for different categories of a categorical variable. D either equals 0 or 1. If 0, whole term = 0, if 1 whole term = b1 How It Works: Each dummy variable shifts the intercept of the regression line for its respective category. The coefficients of intercept dummy variables represent the difference in the intercept for each category compared to the reference group.

Answer 22

Purpose: Adjust the slope of the regression model for different categories of a categorical variable. DX captures the change in the slope on account of the dummy variable. How It Works: Each dummy variable interacts with a continuous predictor to change the slope of the regression line for its respective category. The coefficients of slope dummy variables represent the difference in the slope for each category compared to the reference group.

Answer 23

A logistic regression model is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. i.e. failure, success or increase, decrease They estimate the probaility (log odds) of an event based on the logistic distribution.

Answer 24

The coefficients (beta) represent the change in the log-odds of the outcome for a one-unit increase in the x variable.

Answer 25

Intercept (b_0 = -2): The log-odds of passing the exam when study hours and attendance are zero. Study Hours (b_1 = 0.05): For each additional hour of study, the log-odds of passing the exam increase by 0.05. Odds Ratio: e^{0.05} \approx 1.051. Each additional hour of study multiplies the odds of passing by approximately 1.051. Attendance (b_2 = 0.3): For each additional unit of attendance, the log-odds of passing the exam increase by 0.3. Odds Ratio: e^{0.3} \approx 1.35. Each additional unit of attendance multiplies the odds of passing by approximately 1.35. Positive Coefficient: Indicates an increase in the log-odds (and thus the odds) of the outcome. Negative Coefficient: Indicates a decrease in the log-odds (and thus the odds) of the outcome.

Answer 26

to evaluate competing models with the same dependent variable. higher value = better fit

Answer 27

e to the co-efficient will convert the log of odds to odds. p/(1+p) will convert to probability

Answer 28

when the p value is smaller than the critical value/alpha. means it is statistically significant

Answer 29

A likelihood ratio is a statistical measure used to compare the goodness of fit between two models. In the context of regression, it helps determine whether a more complex model significantly improves the fit of the data compared to a simpler model. Chi square distribution with q dof. q = omitted variables in the restricted model 1 tail test reject null if chi square > critical value. means omitted values are useless, do not add to explanatory power (= far from 0) LR = -2 (log likelihood restricted model - log liklihood unrestricted model) log likelihood metric = negative higher values = better fitting model

Reading 1: Multiple Regression Flashcards

(56 cards)