Quantitative Methods Flashcards
1.1 Basics of Multiple Regression Underlying Assumptions
– describe the types of investment problems addressed by multiple linear regression and the regression process
– formulate a multiple linear regression model, describe the relation between the dependent variable and several independent variables, and interpret estimated regression coefficients
– explain the assumptions underlying a multiple linear regression model and interpret residual plots indicating potential violations of these assumptions
Uses of Multiple Linear Regression
Multiple linear regression is a statistical method used to analyze relationships between a dependent variable (explained variable) and two or more independent variables (explanatory variables). This method is often employed in financial analysis, such as examining the impact of GDP growth, inflation, and interest rates on real estate returns.
1- Nature of Variables:
– The dependent variable (Y) is the outcome being studied, such as rate of return or bankruptcy status.
– The independent variables (X) are the predictors or explanatory factors influencing the dependent variable.
2- Continuous vs. Discrete Dependent Variables:
– If the dependent variable is continuous (e.g., rate of return), standard multiple linear regression is appropriate.
– For discrete outcomes (e.g., bankrupt vs. not bankrupt), logistic regression is used instead.
– Independent variables can be either continuous (e.g., inflation rates) or discrete (e.g., dummy variables).
3- Forecasting Future Values:
– Regression models are built to forecast future values of the dependent variable based on the independent variables.
– This involves an iterative process of testing, refining, and optimizing the model.
– A robust model must satisfy the assumptions of multiple regression, ensuring it provides a statistically significant explanation of the dependent variable.
4- Model Validation:
– A good model exhibits an acceptable goodness of fit, meaning it explains the variation in the dependent variable effectively.
– Models must undergo out-of-sample testing to verify predictive accuracy and robustness in real-world scenarios.
5- Practical Applications:
– Regression models are widely used in finance, economics, and business to understand complex relationships and make informed decisions.
– For example, they can assess the drivers of asset performance, predict financial distress, or evaluate economic impacts on industries.
Key Steps in the Regression Process:
Define the Relationship:
– Begin by determining the variation of the dependent variable (Y) and how it is influenced by the independent variables (X).
Determine the Type of Regression:
– If the dependent variable is continuous (e.g., rates of return): Use multiple linear regression.
– If the dependent variable is discrete (e.g., bankrupt/not bankrupt): Use logistic regression.
Estimate the Regression Model:
– Build the model based on the selected independent and dependent variables.
Analyze Residuals:
– Residuals (errors) should follow normal distribution patterns.
– If they don’t, adjust the model to improve fit.
Check Regression Assumptions:
– Ensure assumptions like linearity, independence of residuals, and homoscedasticity (constant variance of residuals) are satisfied.
– If not, return to model adjustment.
Examine Goodness of Fit:
– Use measures like R^2
and adjusted R^2 to evaluate how well the independent variables explain the variation in the dependent variable.
Check Model Significance:
– Use hypothesis testing (e.g., p-values, F-tests) to assess the overall significance of the model.
Validate the Model:
– Determine if this model is the best possible fit among alternatives by comparing performance across different validation metrics.
Use the Model:
– If all criteria are met, use the model for analysis and prediction.
Iterative Nature:
If at any step the assumptions or fit are not satisfactory, analysts refine the model by adjusting variables, transforming data, or exploring alternative methodologies. This ensures the final model is both robust and accurate for predictive or explanatory purposes.
Basics of Multiple Linear Regression
1- Objective:
– The purpose of multiple linear regression analysis is to explain the variation in the dependent variable (Y), referred to as the sum of squares total (SST).
2- SST Formula:
SST = n_∑_i=1 (Yi - Y_bar)^2
– Where:
— Yi: Observed value of the dependent variable.
— Y_bar: Mean of the observed dependent variable.
3- General Regression Model:
Yi = b0 + b1X1i + b2X2i + … + bkXki + ei
– Where:
— Yi: Dependent variable for observation i.
— b0: Intercept, representing the expected value of Yi when all X values are zero.
— b1, b2, …, bk: Slope coefficients, which quantify the effect of a one-unit change in the corresponding X variable on Y, while holding other variables constant.
— X1i, X2i, …, Xki: Values of the independent variables for the i-th observation.
— ei: Error term for observation i, representing random factors not captured by the model.
4- Key Features:
– A model with k partial slope coefficients will include a total of k+1 regression coefficients (including the intercept).
– The intercept (b0) and slope coefficients (b1, b2, …, bk) together describe the relationship between the independent variables (X) and the dependent variable (Y).
Assumptions for Valid Statistical Inference in Multiple Regression
1- Linearity:
– The relationship between the dependent variable (Y) and the independent variables is linear.
2- Homoskedasticity:
– The variance of the residuals (e_i) is constant for all observations.
3- Independence of Observations:
– The pairs (X, Y) are independent of each other.
– Residuals are uncorrelated across observations.
4- Normality:
– Residuals (e_i) are normally distributed.
5- Independence of the Independent Variables:
– 5a. Independent variables are not random.
– 5b. There is no exact linear relationship (no multicollinearity) between any of the independent variables.
[Quiz - Regression Assumptions and Scatterplot Diagnostics]
1- Overview of Regression Assumptions and Visual Testing Using Scatterplots
– In multiple linear regression, verifying that the five key assumptions are met is essential to ensure the validity of the model.
– Specific scatterplots are used to test each assumption by plotting relevant variables or residuals.
2- Assumption 1: Linearity
– Variables to plot: — Y-axis: Residuals
— X-axis: Predicted values of the dependent variable
3- Assumption 2: Normality of Residuals
– Visualization methods: — Histogram of residuals
— Q-Q plot of residuals
4- Assumption 3: Homoskedasticity (Constant Variance of Residuals)
– Variables to plot: — Y-axis: Residuals
— X-axis: Predicted values of the dependent variable
5- Assumption 4: Independence of Observations (No Serial Correlation)
– Variables to plot (for each independent variable): — Y-axis: Residuals
— X-axis: Observed values of the independent variable (e.g., GDP, INF)
6- Assumption 5: Independence of the Independent Variables (No Multicollinearity)
– Visualization methods: — Correlation matrix of independent variables
— Pairwise scatterplots: —- Y-axis: One independent variable
—- X-axis: Another independent variable
Assessing Violations of Regression Assumptions (Graphical Approach)
1- Linearity:
– Respected: The scatterplot of the dependent variable against each independent variable shows a clear linear trend (straight-line pattern).
– Not Respected: The scatterplot shows a non-linear pattern (curved or irregular trends), indicating the need for transformations or additional variables to capture the relationship.
2- Homoskedasticity: Dots randomly below and above the 0 line (no patterns)
– Respected: A plot of residuals against predicted values shows points evenly scattered around a horizontal line, with no discernible pattern or clusters.
– Not Respected: The plot shows a cone-shaped or fan-shaped pattern, indicating that the variance of residuals increases or decreases systematically (heteroskedasticity).
3- Independence of Observations: Dots randomly below and above the 0 line (no patterns)
– Respected: Residuals plotted against independent variables or observation order show a flat trendline with no clustering or patterns.
– Not Respected: The plot reveals systematic trends, cycles, or clustering in residuals, suggesting autocorrelation or dependence between observations.
4- Normality:
– Respected: A Q-Q plot shows residuals closely aligned along the straight diagonal line, indicating they follow a normal distribution.
– Not Respected: Residuals deviate significantly from the diagonal line, particularly at the tails, indicating a non-normal distribution (e.g., “fat-tailed” or skewed residuals).
5- Independence of Independent Variables (Multicollinearity):
– Respected: Pairs plots between independent variables show no clear clustering or trendlines, suggesting low correlation.
– Not Respected: Pairs plots show strong clustering or clear linear relationships between independent variables, indicating multicollinearity that could distort regression results.
Outliers can have very big impacts, but they are not necessarly bad
1.2 Evaluating Regression Model Fit and Interpreting Model Results
– evaluate how well a multiple regression model explains the dependent variable by analyzing ANOVA table results and measures of goodness of fit
– formulate hypotheses on the significance of two or more coefficients in a multiple regression model and interpret the results of the joint hypothesis tests
– calculate and interpret a predicted value for the dependent variable, given the estimated regression model and assumed values for the independent variable
Coefficient of Determination (R^2)
Definition:
– R^2 measures the proportion of the variation in the dependent variable (Y) that is explained by the independent variables in a regression model.
– It reflects how well the regression line fits the data points.
Formula:
– R^2 = (Sum of Squares Regression) / (Sum of Squares Total)
Key Notes:
– R^2 can also be computed by squaring the Multiple R value provided by regression software.
– It is a common measure for assessing the goodness of fit in regression models but has limitations in multiple linear regression:
Limitations of R^2 in Multiple Linear Regression:
– It does not indicate whether the coefficients are statistically significant.
– It fails to reveal biases in the estimated coefficients or predictions.
– It does not reflect the model’s overall quality.
– High-quality models can have low R^2 values, while low-quality models can have high R^2 values, depending on context.
Adjusted R^2
Definition:
– Adjusted R^2 accounts for the degrees of freedom in a regression model, addressing the key problem of R^2: its tendency to increase when additional independent variables are included, even if they lack explanatory power.
– It helps prevent overfitting by penalizing models for unnecessary complexity.
Formula:
– Adjusted R^2 = 1 - [(Sum of Squares Error) / (n - k - 1)] ÷ [(Sum of Squares Total) / (n - 1)]
– Alternate Formula: Adjusted R^2 = 1 - [(n - 1) / (n - k - 1)] × (1 - R^2)
Key Implications:
– Adjusted R^2 is always less than or equal to R^2 because it penalizes models with more independent variables.
– Unlike R^2, Adjusted R^2 can be negative if the model explains less variation than expected.
– Adjusted R^2 increases only if the added independent variable improves the model’s explanatory power significantly.
Relationship with t-statistic:
– If the absolute value of the new coefficient’s t-statistic is greater than 1, Adjusted R^2 will increase.
– If the absolute value of the new coefficient’s t-statistic is less than 1, Adjusted R^2 will decrease.
[Quiz - Impact of Adding Variables on R² and Adjusted R²]
1- Overview of the Concept
– In regression analysis, adding an independent variable generally increases R² because R² measures the proportion of variance in the dependent variable explained by the model.
– However, adjusted R² adjusts for the number of independent variables and only increases if the new variable improves explanatory power more than it reduces degrees of freedom.
– Therefore, adjusted R² is the preferred measure when comparing models with different numbers of predictors.
2- Application to the Case
– Model 2 introduces a new independent variable: NPL (nonperforming loans).
– According to the correlation matrix, NPL has a slightly negative correlation with EMR (dependent variable), but not a strong one.
– Since R² typically increases (or stays the same) with any additional variable, R² will most likely have increased in Model 2.
– However, Milner concludes that Model 2 fits the data worse than Model 1, indicating that the adjusted R² has decreased.
3- Interpretation
– The adjusted R² statistic declined due to the inclusion of a weakly explanatory variable (NPL) and the corresponding loss of degrees of freedom.
– This decline reflects that Model 2’s added variable does not provide enough explanatory power to justify its cost in terms of model complexity.
Key Takeaways
– Adding a variable increases R² unless the variable is perfectly uncorrelated with the dependent variable.
– Adjusted R² accounts for changes in degrees of freedom and penalizes model complexity.
– If a new variable does not meaningfully improve explanatory power, adjusted R² will decrease, even if R² increases.
Analysis of Variance (ANOVA)
Purpose:
– ANOVA is a statistical method used in regression analysis to break down the variation in a dependent variable into two components: explained and unexplained variance.
Key Components:
– Sum of Squares Total (SST): The total variation in the dependent variable.
– Sum of Squares Regression (SSR): The portion of the variation explained by the regression model.
– Sum of Squares Error (SSE): The residual variation not explained by the model.
Relationship: SST = SSR + SSE.
Application:
– The data in an ANOVA table can be used to calculate R^2 and Adjusted R^2 for a regression model.
– Example: If SST = 136.428, SSR = 53.204, and SSE = 83.224, these values can be plugged into the formula for Adjusted R^2 to verify its accuracy.
Limitations of Adjusted R^2:
– Unlike R^2, Adjusted R^2 cannot be interpreted as the proportion of variance explained.
– Adjusted R^2 does not indicate the significance of regression coefficients or the presence of bias.
– Both R^2 and Adjusted R^2 are limited in assessing a model’s overall fit.
Adjusted R^2 = 1 - [(Sum of Squares Error) / (n - k - 1)] ÷ [(Sum of Squares Total) / (n - 1)]
What is Parsimonious ?
Parsimonious meaning that it includes as few independent variables as possible to adequately explain the variance of the dependent variable.
Measures of Parsimony in Regression Models
– A high-quality multiple linear regression model is parsimonious, meaning it includes as few independent variables as possible to adequately explain the variance of the dependent variable.
– Two key measures of parsimony are:
Akaike’s Information Criterion (AIC).
Schwarz’s Bayesian Information Criterion (SBC) (also referred to as the Bayesian Information Criterion, or BIC).
AIC and SBC Formulas:
AIC Formula:
– AIC = n * ln(SSE/n) + 2 * (k + 1)
SBC Formula:
– SBC = n * ln(SSE/n) + ln(n) * (k + 1)
Explanation of Components:
– n: Number of observations.
– k: Number of independent variables.
– SSE: Sum of squared errors.
Key Notes:
– Both AIC and SBC penalize for additional independent variables to discourage overfitting.
– These measures differ in the penalty term:
– AIC applies a penalty of 2 * (k + 1).
– SBC applies a penalty of ln(n) * (k + 1).
– Lower scores are better for both measures, as they indicate a better model fit relative to complexity.
Mathematical Differences and Practical Implications:
– SBC is more conservative than AIC because ln(n) grows larger than 2 for datasets with more than 7 observations. This means SBC imposes stricter penalties for adding variables.
– These scores are meaningless in isolation and should instead be used to compare models as independent variables are added, removed, or replaced.
Application of AIC and SBC:
– AIC is the preferred measure when a model is meant for predictive purposes.
– SBC is better suited for assessing a model’s goodness of fit for descriptive purposes.
Applications and Key Insights
These measures help compare models as independent variables are added, removed, or replaced.
Example:
– Model A has the lowest SBC score, indicating it is the most parsimonious model for fit.
– Model B has the lowest AIC score, suggesting it is best for forecasting.
Important Note: AIC and SBC scores are relative and should not be interpreted in isolation. They are used to compare models within the same dataset.
t-Test for Individual Coefficients
In regression analysis, a t-test is used to evaluate the statistical significance of individual slope coefficients in a multiple regression model. The goal is to determine if a given independent variable has a meaningful impact on the dependent variable.
Null and Alternative Hypotheses:
– To assess whether a slope coefficient is statistically significant, analysts test the following hypotheses:
– Null hypothesis (H₀): bᵢ = Bᵢ
(The slope coefficient is equal to the hypothesized value.)
– Alternative hypothesis (Hₐ): bᵢ ≠ Bᵢ
(The slope coefficient differs from the hypothesized value.)
– Default Hypothesis:
– Most often, Bᵢ = 0 is tested, which means the independent variable has no effect on the dependent variable.
t-Statistic Formula:
– The t-statistic is calculated using:
t = (bᵢ - Bᵢ) / s₍bᵢ₎
– Where:
– bᵢ = Estimated value of the slope coefficient.
– Bᵢ = Hypothesized value of the slope coefficient.
– s₍bᵢ₎ = Standard error of the slope coefficient.
[Quiz - Identifying the Highest t-Statistic Among Regression Variables]
1- Overview of the Concept
– The t-statistic is used to test the null hypothesis that a regression coefficient is equal to zero (i.e., that the variable has no effect). A higher t-statistic suggests stronger evidence that the variable is statistically significant in explaining the dependent variable.
2- Formula
– Name of formula: t-statistic for a regression coefficient.
– Formula: “t = (b̂ − b) ÷ sb”
– Where:
— b̂: Estimated slope coefficient.
— b: Hypothesized slope (typically 0).
— sb: Standard error of the estimated coefficient.
3- Application to Model 1
– Using the formula, the t-statistics are calculated as:
— For GDP: t = (3.2967 − 0) ÷ 1.4872 ≈ 2.2167
— For CPI: t = (2.2796 − 0) ÷ 0.6904 ≈ 3.3019
— For DCP: t = (0.1180 − 0) ÷ 0.0705 ≈ 1.6738
4- Interpretation
– CPI has the highest t-statistic (≈ 3.3019), indicating that among the three independent variables in Model 1, it is the most statistically significant predictor of equity market return (EMR).
Key Takeaways
– A high t-statistic means a variable is more likely to be a statistically significant predictor.
– In this case, inflation (CPI) is the strongest explanatory variable in the model.
– This suggests that inflation has the most reliable relationship with equity market returns, given the sample data.
Testing the t-Statistic
Comparison with Critical Value:
– The calculated t-statistic is compared with a critical value based on the desired level of significance (α) and degrees of freedom (df).
– Degrees of freedom = n - k - 1, where:
– n = Number of observations.
– k = Number of independent variables.
p-Value Approach:
– Statistical software often calculates a p-value, which indicates the lowest level of significance at which the null hypothesis can be rejected.
– For example:
– If the p-value is 0.03, the null hypothesis can be rejected at a 5% significance level but not at a 1% level.
Interpreting Results:
If the t-statistic’s absolute value exceeds the critical value or the p-value is smaller than the chosen significance level (e.g., α = 0.05):
– Reject H₀: The coefficient is statistically significant, suggesting the independent variable has an effect on the dependent variable.
If the t-statistic’s absolute value does not exceed the critical value or the p-value is larger than α:
– Fail to Reject H₀: The coefficient is not statistically significant.
F-Test for Joint Hypotheses
The F-test is used to evaluate whether groups of independent variables in a regression model collectively explain the variation of a dependent variable. Instead of testing each independent variable separately, it tests their combined explanatory power to ensure that they are collectively meaningful.
1- Concept Overview:
Purpose: To determine if adding a group of independent variables significantly improves the explanatory power of the regression model.
2- Comparison of Models:
Unrestricted Model: Includes all independent variables being tested.
Restricted Model: Excludes the variables being tested for joint significance.
These models are referred to as nested models because the restricted model is essentially a subset of the unrestricted model.
Hypotheses:
Null Hypothesis (H₀): The additional variables (SNPT and LEND) do not add explanatory power:
b_SNPT = b_LEND = 0
Alternative Hypothesis (Hₐ): At least one of the additional variables has a statistically significant impact:
b_SNPT ≠ 0 and/or b_LEND ≠ 0
F-Statistic Formula:
The F-statistic is calculated using:
F = [(SSE_R - SSE_U) / q] ÷ [SSE_U / (n - k - 1)]
Where:
SSE_R: Sum of squared errors for the restricted model.
SSE_U: Sum of squared errors for the unrestricted model.
q: Number of restrictions (variables excluded from the restricted model).
n: Number of observations.
k: Number of independent variables in the unrestricted model.
Steps to Perform the F-Test:
1- Compute the F-Statistic:
Calculate the difference in SSE between the restricted and unrestricted models.
Adjust for the number of restrictions (q) and the degrees of freedom in the unrestricted model.
2- Compare with Critical Value:
Find the critical F-value from the F-distribution table based on the significance level (e.g., 5%) and degrees of freedom (numerator: q, denominator: n - k - 1).
3- Decision:
If F > critical value, reject H₀. The additional variables collectively add explanatory power.
If F ≤ critical value, fail to reject H₀. The additional variables do not significantly improve the model.
Example Calculation:
Given data:
SSE_R = 83.224 (Restricted Model B)
SSE_U = 81.012 (Unrestricted Model D)
q = 2 (Two additional variables: SNPT and LEND)
n = 40 (Observations)
k = 5 (Independent variables in unrestricted model)
Compute the F-statistic:
F = [(83.224 - 81.012) / 2] ÷ [81.012 / (40 - 5 - 1)]
F = 0.464
Compare with critical value:
At 5% significance level, critical F-value = 3.276 (for q = 2 and df = 34).
Since F = 0.464 < 3.276, fail to reject H₀.
Conclusion:
The F-test shows that the additional variables (SNPT and LEND) do not significantly improve the explanatory power of the model. Thus, a more parsimonious model (Model B) may be preferred.
General Linear F-Test
The General Linear F-Test is used to assess the overall significance of an entire regression model. Also known as the goodness-of-fit test, it evaluates the null hypothesis that none of the slope coefficients are statistically different from zero. This test determines whether the independent variables, collectively, explain a significant proportion of the variance in the dependent variable.
Formula for the F-Statistic:
F = Mean Square Regression (MSR) ÷ Mean Square Error (MSE)
MSR is the mean square regression, which measures the explained variation.
MSE is the mean square error, which measures the unexplained variation.
or