CFA L2 Quant Flashcards

1
Q

True or false: With financial instruments, we can typically use a one-factor linear regression model?

A

False, typically we need a multiple regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Multiple regression model

A

Regression models that allow to see the effects of multiple independent variables on one dependent variable.Ex: Can the 10-year growth in the S&P 500 (dependent variable (Y)) be explained by the trailing dividend payout ratio of the index’s stocks (independent variable 1 (X1)) and the yield curve slope (independent variable 2 (X2))?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the uses of multiple regression models?

A

Identify relationships between variables.Forecast variables. (ex: forecast CFs or forecast probability of default)Test existing theories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Residual (ε)

A

The difference between the observed Y value and the predicted Y value (ŷ).ε = Y - ŷ ORY - (b0 + b1x1 + b2x2 … + bnxn)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

P-value

A

The smallest level of significance for which the null hypothesis can be rejected.If the p-value is less than the significance level (α), the null hypothesis can be rejected and if it’s greater it is failed to be rejected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

If the significance level is 5% and the p-value is .06, do we reject the null hypothesis?

A

No, we fail to reject the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Q-Q plot

A

A plot used to compare a variable’s distribution to a normal distribution. The residual of the variable’s distribution should lie along a diagonal line if they follow a normal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or false: For a standard normal distribution, only 5% of the observations should be beyond -2 standard deviations of 0?

A

False, only 5% of the observations should be beyond -1.65 standard deviations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Coefficient of determination (R^2)

A

The percentage of the total variation in the dependent variable explained by the independent variable.R^2 = SSR/SSTOR(SST - SSE) / SSTEx: R^2 of 0.63 means that the model explains 63% of the variation in the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Akaike’s information criterion (AIC)

A

Looks at multiple regression models and determines which has the best forecast.Lower values indicate a better model.Higher k values result in higher values of the criteria.Calculation: (n * ln(SSE/n)) + 2(k+1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Schwarz’s Bayesian information criteria (BIC)

A

Looks at multiple regression models and determines which has a better goodness of fit.Lower values indicate a better model.Higher k values result in higher values of the criteria.BIC imposes a higher penalty for overfitting than AIC.Calculation: (n * ln(SSE/n)) + (ln(n)*(k+1))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Joint F-Test

A

Measures how well a set of independent variables, as a group, explains the variation in the dependent variable. Put simply, it tests overall model significance.Calculation: [ (SSErestricted - SSEunrestricted) / Q ] / [ (SSEunrestricted) / (n - k - 1) ]Q = # of excluded variables in the restricted model. Decision rule: reject the null hypothesis if F-stat > F critical value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True or false: We could also use a t-test to evaluate the significance to see which variables are significant?

A

True, but the F-test provides a more meaningful evaluation since there is likely some amount of correlation among independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

True of false: The F-test will tell us if at least one of the slope coefficients in a multiple regression model is statistically different from 0?

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

True or false: When testing the hypothesis that all the regression coefficients are simultaneously equal to 0, the F-test is always a two tailed test?

A

False, when testing the hypothesis that all the regression coefficients are simultaneously equal to 0, the F-test is always a one tailed test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or false: We can use the regression equation to make predictions about the dependent variable based on forecasted values of the independent variable?

A

True, we can make predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Predicting the dependent variable from forecasted values of the independent variable:

A

ŷ = predicted value of the intercept + (X1 * estimated slope coefficient for X1) + (X2 * estimated slope coefficient for X2)…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Functional form misspecifications (A regression suffers from misspecification of the functional form when the functional form of the estimated regression model differs from the functional form of the population regression function):

A

Omission of important independent variables: may lead to biased and inconsistent regression parameters AND serial correlation or heteroskedasticity in the residuals.Inappropriate variable form (ex: you may need to take the natural log of a variable): may lead to heteroskedasticity in the residuals. This can happen if there is no linear relationship between the independent & dependent variables. Inappropriate variable scaling (ex: common-size financial statements): May lead to heteroskedasticity in the residuals or multicollinearity.Data improperly pooled: May lead to heteroskedasticity or serial correlation in the residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Heteroskedasticity

A

When the variance of the residuals is not constant across all observations in the sample. This happens when there are subsamples that are more spread out than the rest of the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Unconditional heteroskedasticity

A

When the heteroskedasticity is not related to the # of independent variables meaning heteroskedasticity won’t increase/decrease as the amount of independent variables increase/decrease.Although it’s a violation of our assumptions, it is usually not a big problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Conditional heteroskedasticity

A

Heteroskedasticity that is related to the # of independent variables. Creates significant problems for statistical interference if not corrected properly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Effects of conditional heteroskedasticity

A

If the pattern of heteroskedasticity is low (most observations on the plot are low values):Standard errors (SEE) of the coefficients in a regression are affected by conditional heteroskedasticity and usually become unreliable estimates by being underestimated. This will lead to the T-stat being too large too often and thus rejecting the null too often, a.k.a type 1 error.For the F test (MSR/MSE), MSE is underestimated, and therefore the F-stat is often too large leading to the null is rejected too often, a.ka type 1 error.If the pattern of heteroskedasticity is high (most observations on the plot are high values): the same errors will happen but in the opposite direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How to detect conditional heteroskedasticity

A

There are two methods of detection: examining scatter plots of the residuals and by using the Breusch-Pagan chi-square test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Serial correlation/autocorrelation

A

When residuals are correlated with each other.Poses serious problems when using time series data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Positive serial correlation

A

When a positive residual in one time period increases the probability of observing a positive residual in the next time periodThis type of correlation typically results in coefficient standard errors that are too small, causing T-stats or F-stats to be too large, which will lead to type 1 errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Effect of serial correlation on model parameters

A

If the dependent variable’s reaction to the independent variable has a lag in a regression model, serial correlation causes the estimates of the slope coefficients to be inconsistent. If there is no lag, then the estimates of the slope coefficient will be consistent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How to detect serial correlation?

A

First, we can use a scatter plot. This will show very dramatic scenarios. We can also use a Durbin-Watston (DW) statistic or a Breusch-Godfrey (BG) test. The DW statistic is used to detect serial correlation at a single lag, whereas a BG test is used to detect serial correlation at multiple lags.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Breusch-Godfrey (BG) Test

A

The BG Test regresses the residuals against the original set of independent variables, plus one or more additional variables representing lagged residuals.Calculation: ε = a1x1 + a2x2… + p1x1 + pnxnThe null under the BG test is that there is no serial correlation (i.e p1=0).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How to detect multicollinearity?

A

The most easily observable sign is when t-tests indicate none of the individual coefficients are significantly different than zero, but the F-test indicates that at least one of the coefficients is statistically significant and the R^2 is high. This means that none of the individual variables cause variation in the dependent variable but combined together they are highly correlated which washes out the individual effects.More formally we use a variance inflation factor (VIF) for each of the independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Variance inflation factor (VIF)

A

Estimates how much of the variation in the dependent variable in a multiple regressions model is due to multicollinearity. We start by regressing one of the independent variables (making it a dependent variable) against the remaining independent variables.VIF= 1 / (1 - Rj^2)VIF values >1 indicates that the variable is not highly correlated with other independent variables.VIF values >5 indicate further investigation.VIF values >10 indicate high correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How to correct multicollinearity?

A

The most common method to correct for multicollinearity is to omit one or more of the highly correlated independent variables. You can also use a proxy for one of the variables or increase the sample size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

True or false: The coefficient on a variable in a multiple regression is the amount of return attributable to the variable?

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

True or false: Using actual instead of expected inflation will improve model specification?

A

False, using actual instead of expected inflation is likely to result in model misspecification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Leverage (in statistics)

A

This is a way of identifying extreme observations in the independent variable. A measure of the distance between the xth observation of the independent variable relative to its sample mean. Leverage values will be between 0 and 1. The closer to 1 the farther the distance. If a variable’s leverage is higher than three times the average ((3*(k+1))/n), it is considered potentially influential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Studentized residuals

A

A way of identifying outliers. The studentized residual is the # of standard deviations the data point is from the regression line. For each data point, the residual ÷ standard division is its standardized residual. There are four main steps to this process:Estimate the regression model using the original sample size and then delete one observation and re-estimate the regression. Perform this sequentially deleting a new observation each time.Compare the actual Y values of the deleted observation to the predicted y-values. ei= Y-ŷThe studentized residual is the residual in #2 ÷ standard deviation. t= ei / s Compare the studentized residuals to critical values in a t-table using n-k-2 df. Points that fall in the rejection region are termed outliers and potentially influential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

True or false: All outliers and high-leverage points are influential on the regression?

A

FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Cook’s Distance

A

A composite metric for evaluating if a high leverage and/or outlier is influential. Cook’s distance measures how much the estimated values of the regression change if certain high leverage points or outliers are deleted from the sample.Calculation:D= [ ei^2 / ((K+1) * MSE) ] * [ hi / (1-hx)^2 ]hi= leverage value for the xth observationei= the residual for the ith observationValues > than √(k/n) indicate the observation is highly likely to be an influential data point.Generally, values > 1 indicate highly influential, whereas values > 0.5 indicate the need for further investigation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Dummy variables

A

Binary variables with only two options.When assigning a numerical value, it can only be 0 and 1.Always use (n-1) dummy variables to avoid multicollinearity (i.e., 3 dummy variables for 4 quarters in a year).Ex: True/falseEx 2: On/off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Dummy variables example:

A

EPS for four quartersEPS = 1.25 + 0.75Q1 - 0.20Q2 + 0.10Q3Question 1: What this the predicted EPS for Q4?Answer 1: EPS = 1.25 + 0.75(0) - 0.20(0) + 0.10(0) = 1.25omitted quarter shows as the interceptQuestion 2: What is the predicted value for Q1?Answer 1: EPS = 1.25 + 0.75(1) - 0.20(0) + 0.10(0) = 2.00Question 3: What is the predicted EPS for Q1 of next year?Answer 1: EPS = 1.25 + 0.75(1) - 0.20(0) + 0.10(0) = 2.00This simple model uses average EPS for any specific quarter over the past ten years as a forecast of EPS in its respective quarter of the following year.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Logistic regression (logit) model

A

Estimates the probability of a DISCRETE binary variable occurring.Logit models assume that residuals have a logistic distribution- similar to a normal distribution but with fatter tails.Logit models are nonlinear.Calculation: ln(p/(1-p)) = b0 + b1x1 + b2x2 … + εThe intercept value is an estimate of log odds when the values of all independent variables is zero.The change in log odds when one of the independent variables change is dependent on the curvature of the function.Odds= e^yProbability = 1 / (1 + n(p/(1-p)))OR1 / (1 + e^(-yhat))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Likelihood ratio (LR) test

A

Similar to joint F-test but for logit models. Measures the goodness of fit of a logit model.Calculation= -2 * (log likelihood restricted model - log likelihood unrestricted model).Recall, the restricted model has fewer independent variables.Always provides a negative value. Values closer to 0 indicate a better-fitting model. LR test is a chi-square distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Time-series data

A

A set of observations taken periodically (most often at equal intervals) at different points in time.A key feature of a time series is that new data can be added w/o affecting the existing data.Trends can be found by plotting these observations on a graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Linear trend

A

1/2 broad types of trend models. A time-series trend that can be graphed using a straight line. The independent variable will be time. A downward sloping linear trend indicates a negative trend and vice versa for a positive trend.Simplest form: Y= bo +b1(t) + b2(t) … + ε

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Log-linear trend model

A

1/2 broad types of trend models. This is used to model positive and negative exponential growth. Recall, exponential growth is some constant growth rate (positive or negative). Exponential growth will show a convex curve. Simplest form: e^(b0 + b1(t))b1 is the constant rate of growth.Rather than trying to fit the nonlinear data with a linear (straight line) regression, we take the natural log of both sides and transform it into a linear trend line called the log-linear model. This increases the predictive ability of the model.Form: ln(y) = ln(e^(b0 + b1(t)))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

How to determine if a linear or log-linear trend model should be used?

A

Plot the data. A linear trend model may be used if the data points are equally distributed above and below the regression line (ex: inflation data is usually modeled with a linear trend model). If, when plotted, the data plots with a curved shape, use a log-linear trend model (ex: financial data- stock indices and stock prices- are often modeled with log-linear trend models).If there is serial correlation, we will use an autoregressive model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

True or false: For a time series model without serial correlation, the DW statistic should be approximately equal to 0?

A

False, for a time series model without serial correlation, the DW statistic should be approximately equal to 2. A DW that significantly differs from 2 suggests that the residuals are correlated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Autoregressive (AR) model

A

A time-series model that regresses the dependent variable against one or more lagged values of itself. Ex: A regression of the sales of a firm against the sales of the firm in the previous month. In this model, past values are used to predict the current value of the variable.DW test stat cannot be used to test for serial correlation in AR model. Simplest form: Xt = bo + b1x_t-1 …. bpx_t-p + εXt= value of time series at time tX_t-1= value of time series at time t-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Covariance stationary

A

An AR model is covariance stationary if:There is a constant and finite expected value: the expected value is constant over time.Constant and finite variance: the volatility around the time series’ mean is constant over time.The covariance between any two observations w/ equal distance apart will be equal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

True or false: A nonstationary time series can still produce meaningful results sometimes?

A

False, we need stationary covariance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

T-stat for residual autocorrelations in AR model:

A

correlation of the error term with the kth lagged error term ÷ (1 ÷ √n)Standard error = (1 ÷ √n)(n-2) dfn= # of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Mean reversion

A

When a time-series has a tendency to move towards its mean. In other words, the dependent variable has a tendency to decline when the current value is above the mean and rise when the current value is below the mean. If a time series is at its mean reverting level, the model predicts the next value of the time series will be the same as its current value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Mean reverting level calculation

A

Xt = b0 ÷ (1 - b1)The model will not be covariance stationary if b1 = 1If Xt > than the mean reverting level, the model predicts that x_t+1 will be lower than Xt and vice versa.All covariance stationary time series have a finite mean-reverting level.As forecasts become more distant, the value of the forecast will be closer to the mean reverting level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

In-sample

A

Data that was used to develop a regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

True or false: Financial and economic time series inherently exhibit some form of instability or nonstationarity.

A

True. Since financial/economic conditions are dynamic, the coefficients in one period may be different from those in another period. Model with shorter estimated time periods are usually more stable for this reason. When selecting a time series sample, analysts should understand regulatory changes, changes to the economic environment, etc. If there have been large changes, the model may not be accurate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

True or false: There is a trade-off between statistical reliability in the long run and statistical stability in the short run?

A

True. Statistical reliability= if you use a long time period, there is more statistical reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Random walk

A

When, in an AR model, the value of the dependent variable in one period is equal to the value of the series in the previous period plus a random error term.Form: Xt = X_t-1 + εb0 = 0b1 = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Random walk with a drift

A

The same concept as a random walk but the intercept term is not equal to zero. Thus, the time series model is expected to increase/decrease by the intercept term and the error term.Form: Xt = b0 + X_t-1 + εb1 = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

True or false: A random walk with or w/o a drift is NOT covariance stationary?

A

True, random walks will always have a unit root which makes them not covariance stationary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Dickey-Fuller Test

A

A test we use in an AR model to determine if there’s a unit root.Calculation: Xt = b0 + b1X1 + ε ↠ Xt - X_t-1 = b0 * (b1 X_t-1) - X_t-1 + ε ↠ Xt - X_t-1 = b0 + (b1 -1) * X_t-1 + ε ↠Then, test whether the new coefficient (b-1) [(b-1) a.k.a G] = 0 using a t-test.The null hypothesis is that (b-1)= 0. If the null is failed to be rejected, the time series has a unit root and is nonstationary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

True or false: The Dickey-Fuller test uses the standard T distribution to find the critical values?

A

False, it has its own distribution to calculate the critical values.

61
Q

First differencing calculation

A

If the original time series has a unit root, then ε= Xt - X_t-1Then we will create a new dependent variable: Yt = Xt - X_t-1ORYt = εThen, if we state it in the form of an AR model:Yt= B0 + B1*(Y_t-1) + εB0 = B1 = 0

62
Q

Seasonality

A

A characteristic of a time series in which the data experiences regular and predictable changes that recur every calendar year.If seasonality is present, we MUST adjust the AR model in order for it to be correctly specified.

63
Q

How to correct for seasonality?

A

We add an additional lag of the dependent variable to the original model as another independent variable. The lag will be X_t-4 in a quarterly model or X_t-12 in a monthly model. Calculation: ln(Xt) = b0 + b1 * ln(X_t-1) + b2 * ln(X_t-4) + ε

64
Q

True or false: for a T-test with seasonality, the null hypothesis is that there is seasonality?

A

False. H0 = 0- no seasonality present; Ha ≠ 0- seasonality present.

65
Q

Autoregressive conditional heteroskedasticity (ARCH)

A

When the variance of the residuals in one period is dependent on the variance of the residuals in a previous period in an AR model. When ARCH exists, the standard errors of the coefficients and the hypothesis tests are invalid.

66
Q

How to predict future variance of errors in a time series model?

A

After we run an ARCH model, if we determine that a1 is significant, future variance of errors can predicted by using:σ^2_t+1 = a0hat + a1hat * εt^2We cannot predict future variance if a1 is not significant.

67
Q

Multiple time series

A

When more than one time series is run at the same time.Ex: Yt = b0 + b1 * Xt + εt ↠ Yt and Xt are two different time series.Either or both of these time series could be subject to nonstantionarity.

68
Q

How to test for nonstationarity in a multiple time series model?

A

Run separate DF tests for each time series.If either of the time series’ are nonstationary, the coefficients will be unreliable.

69
Q

How to test whether two times series are cointegrated?

A

Regress one variable on the other:Yt = b0 + b1 * Xt + εYt= value of time series ‘Y’ at time t.Xt= value of time series ‘X’ at time t.Then, the residuals are tested for a unit root using the Dickey-Fuller test with critical t-values calculated by the Engle and Granger (DF-EG Test). If the DF test rejects the null hypothesis (Ho= no cointegration), then we conclude the error terms are covariance stationary and there is cointegration.

70
Q

Structural change

A

A significant shift in the plotted data at a point in time that essentially divides the data into two or more distinct patters.If there is structural change present, you must run two different models- one incorporating data before the data and one after the date.

71
Q

Machine learning (ML)

A

Filters useful info from substantial amounts of data by learning from known examples to find a pattern in the data. Machine learning acts without human intervention.

72
Q

Supervised learning

A

1/3 types of ML. We teach the model, then with that knowledge have it predict future instances. Supervised learning uses labeled data- data where the target variable is defined. Supervised learning is used when the training data contains the ground truth= the target variable. Multiple regression is an example of supervised learning. Regression and classification are the two most common examples of supervised learning. If target variable is continuous then use a regression. If target variable is categorical or ordinal then use a classification model. Output of classification models looks to group observations.

73
Q

Deep learning networks (DLNs)

A

1/3 types of ML that is used for complex tasks such as imagine recognition, natural language processing, etc. Deep learning is based on neural networks. Deep learning is a self-teaching system. A type of NN with many hidden layers (at least two but often more than 20)

74
Q

Reinforced learning algorithms

A

Algorithms that have an agent seek for a reward given restraints. The RL does not rely on labeled data, but rather these programs learn from their own prediction errors.

75
Q

Generalization

A

The extent to which a ML program is able to make out-of-sample predictions.

76
Q

Overfitting for ML

A

When a large number of features (independent variables) are in the data set. Overfitting will decrease the accuracy of out-of-sample forecasts.The training sample will have a high R^2 and the test sample will have a low R^2.

77
Q

True of false: Under supervised learning, a training sample is used to train a ML algorithm and a separate test sample is used to evaluate the model’s ability to accurately predict new data?

A

TRUE

78
Q

How to measure the ability an ML program generalizes?

A

Create three overlapping data sets:Training sample: In-sample data. Used to train the ML algorithm.Validation sample: Out-of-sample data. Used to tune the training model.Test sample: Out-of-sample data.A model that generalized well should have a high R^2 for in-sample and out-of-sample data.

79
Q

Bias errors

A

This is the in-sample error resulting from models with a poor fit.Occurs when there is underfitting.

80
Q

Variance error

A

This is the out-of-sample error resulting from overfitted models that do not generalize well. This is the extent to which the ML model’s results change in response to test and validation sample data.Associated with overfitting.Increases with model complexity.Nonlinear models tend to have high variance error.

81
Q

Base error

A

This is the out-of-sample error resulting from residual errors due to random noise. Just randomness in the data.Decreases with model complexity.Linear models tend to have high base error.

82
Q

Learning curve

A

Plots the accuracy rate in the test sample versus the size of the training sample. A ML model that generalizes well will show an improving accuracy rate as the sample size increases. The in-sample and out-of-sample error rates should converge toward the desired level as the sample size increases.

83
Q

In-sample accuracy rate calculation vs out-of-sample accuracy rate calculation vs base accuracy rate calculation

A

In-sample accuracy rate= 1 - bias error rateOut-of-sample accuracy rate= 1 - variance error rate.Base accuracy rate= 1 - base error rate.

84
Q

True or false: ML models with high bias error will not see the accuracy rates converge?

A

False, the accuracy rates will converge just far below the desired level.

85
Q

True or false: Models with high variance errors will see the accuracy rates of the in-sample data and out-of-sample data converge below the desired level?

A

False, only the in-sample accuracy rate will converge towards the desired level.

86
Q

How to minimize the effects of overfitting with an ML program?

A

Reduce complexity and cross validation.

87
Q

Cross validation

A

AN estimate of out-of-sample error rates directly from the validation sample.

88
Q

Complexity reduction

A

A penalty imposed to exclude features that do not meaningfully contribute to out-of-sample prediction accuracy.

89
Q

Underfitting

A

When the ML algorithm fails to identify an actual relationship. This occurs when there is an oversimplified model.R^2 will be low for in-sample and out-of-sample data.High bias errorLinear functions are susceptible to underfitting.

90
Q

K-fold cross validation

A

A method for alleviating the holdout sample problem: when the training set is reduced too much. This process eliminates sampling bias. There are four steps in this process:Shuffle the data randomly.Divide the data into k equal sub-samples.K-1 samples will be training samples with the remaining sample a validation sample.This process is then repeated k times. The average of the k validation errors is then taken as a reasonable estimate of the model’s out-of-sample error.

91
Q

Least absolute shrinkage and selection operator (LASSO)

A

This is a popular penalized regression model. LASSO attempts to minimize SSE and the sum of the absolute values of the slope coefficients of the regression. The penalty increases with number of features. There is a tradeoff in reducing SSE (increasing independent variables) and the penalty imposed. Investment analysts use LASSO to build parsimonious (few predictor variables) models.

92
Q

Regularization

A

A type of penalized regression. Forces the beta coefficients of nonperforming features towards zero. Regularization can be applied to non-linear models.

93
Q

Support Vector Machine (SVM)

A

A common supervised ML algorithm often used for text data. The model assumes the data is linearly separable; An SVM is a linear classification algorithm. An SVM attempts to find the optimal hyperplane that separates two sets of data (classes) by the max amount using n features.

94
Q

Soft margin classification

A

Handles misclassified observations in the training data in an SVM.

95
Q

K-nearest neighbor (KNN)

A

A common supervised ML algorithm. New observations are classified by finding the nearest” (most similar) between a new observation and its k-nearest piece of data in the current data set. If k=5

96
Q

Ensemble learning

A

A common supervised ML algorithm that combines the predictions from multiple models rather than a single model. The different models cancel out noise and result in a lower average error rate. There are two types of ensemble methods: aggregation of heterogeneous learners and aggregations of homogeneous learners. Ensemble learning typically produces more stable and accurate results than single models. Aims to decrease variance (bagging), decrease bias (boosting), and improving predictions (stacking).

97
Q

Aggregation of heterogeneous learners

A

Different algorithms are combined together through a voting classifier and each algorithm gets a vote. The answer with the most votes is the model we go with.

98
Q

Aggregations of homogeneous learners.

A

The same algorithm is used but on different training data. The different training data used by the same model can be derived through bootstrap resampling (a.k.a bagging).

99
Q

Random Forest

A

A common supervised ML algorithm. This is a variation of a classification tree where a large # of classification trees are trained using bagged data from the same data set. A random subset of features is used in creating each tree, thus every tree is different. This process mitigates overfitting and reduce noise from errors. A drawback of Random Forests is that the transparency of CART is lost.Random forests can INCREASE the signal-to-noise ratio.

100
Q

Principal component analysis (PCA)

A

A common unsupervised ML algorithm. Problems w/ too much noise arise when there are excessive amts of features (high dimensionality). PCA seeks to reduce this excess noise by discarding the excess features. A PCA transforms the feature’s covariance matrix in order to reduce highly correlated features into a smaller # of uncorrelated features, called eigenvectors, which are linear combinations of the original feature. Each eigenvector has an eigenvalue: the proportion of total variance in the data set explained by the eigenvector. The end product is an algorithm with lower dimensionality, which makes the model easier to train and interpret.

101
Q

Scree plot

A

A plot that shows the proportion of total variance explained by each of the principal components.

102
Q

Clustering

A

A common unsupervised ML algorithm. Clustering is the process of grouping observations into categories based on similar attributes (a.k.a cohesion). The two most common types of clustering are: K-means clustering and hierarchical clustering.

103
Q

Cohesion

A

Grouping observations into categories based on the observations’ similarities.

104
Q

Hierarchical clustering

A

1/2 main types of clustering that builds a hierarchy of clusters without any predefined # of clusters.

105
Q

Agglomerative clustering/ Bottom-up clustering

A

1/2 types of hierarchical clusters. This starts with one observation as its own cluster and then adds other similar observations to that group, thus forming another nonoverlapping cluster. In the end, all observations are merged into a single cluster.

106
Q

Neural networks (NNs)

A

Made up of layers of neurons. The first layer is the input layer (node layer), which receives the input. The final layer is the output layer. In between exists hidden layers. Neurons of each layer are connected to neurons of the next layer through channels. There may be multiple hidden layers. The multiple layers allow the NN to model complex nonlinear functions. NNs are an adaptive system that computers use to learn from their mistakes and improve continuously. A group of ML algorithms applied to problems with significant nonlinearity.

107
Q

Divisive clusters/ top-down clustering

A

1/2 types of hierarchical clusters. The algorithm starts with one giant cluster, and then it partitions that cluster into smaller and smaller clusters. In the end, each cluster contains only one observation.

108
Q

Summation operator

A

Neurons comprise the summation operator which gathers the info from the neurons and assigns them a weighted average, then passes the info on to the activation function. The activation function then generates a value from the inputs.

109
Q

Backwards propagation

A

This is how the machine learns from its errors. When the weighted averages from the summation operators are adjusted as the algorithm learns from its errors.

110
Q

Steps in a supervised/ traditional ML model:

A

Conceptualization of the problemData collectionData preparation and wrangling: cleaning the data set and preparing it for the model. Data exploration: Feature selection and performing data analysis. Evaluating the data set and determining the most appropriate way to configure it for model training.Model training: Determining which ML algorithm to use, using a training data set, and tuning the model.

111
Q

Steps in a unsupervised/ textual ML model:

A

Text problem formulationText curation: ensuring the quality of data, for example by adjusting for bad or missing data.Text preparation and wranglingText explorationModel training

112
Q

Data cleansing

A

Reducing errors in raw data. Common errors include:Missing valuesInvalid valuesInaccurate valuesNon-uniform valuesDuplicate observations

113
Q

Data wrangling

A

Prepping data for model use. This includes transforming and scaling. Data transformations include:ExtractionAggregation: consolidating two variables into one (using appropriate weighting)Filtration: removing irrelevant observations.Selection: removing features not needed for processing.Conversion of data of diverse types

114
Q

README Files

A

Contain info about how, what, and where the data is stored. Helps ensure validity.

115
Q

Metadata

A

Data that describes other data by providing info about one or more aspects of the data. Essentially a summary.

116
Q

Winsorization

A

A way researchers exclude outliers. Instead of entirely excluding outliers, they substitute reasonable values in for them.

117
Q

Trimming

A

One way researchers exclude outliers. This type of means excludes a certain portion of the highest values and lowest values. For example, excludes lowest 1% and highest 1% of all values.

118
Q

Normalization

A

1/2 common types of scaling. Scales variable values between 0 and 1.Sensitive to outliers.Use this when trying to understand where the variables lie within the data set.Calculation: (Xi - Xminimum) ÷ (Xmaximum - Xminimum)

119
Q

Cleansed text is normalized using these steps:

A

Lowercasing: Ex: Dog ↠ dogRemoval of stop words: super common unimportant words Ex: the, is, and, etc. Stemming: Take similar words and combine them into a single word. Ex: integrate ↠ integration ↠ integrating Lemmatization: Return the base of the word. Ex: saw ↠ seeBag-of-words (BOW): A bow is just the results of steps #1-#4. All the collected words or tokens are collected w/o regard to occurrence. If order doesn’t matter we can stop here.N-gram: If ordering is important, we can create a two-gram to look for two specific words that go together or three-gram that looks for three words that go together, and so on.Organizing the BOW and N-Gram into a document term matrix (DTM):

120
Q

Token (in text wrangling)

A

A word

121
Q

Black box approach to ML

A

ML models that give you a result without explaining how they get to their decision.

122
Q

True or false: In feature selection, we try to ONLY include the features that contribute to the model’s out-of-sample predictive power.

A

TRUE

123
Q

Feature extraction

A

When a feature is created from the data set.Ex: Creating a value for age using date of birth data.

124
Q

One-hot encoding (OHE)

A

A type of future engineering. The process used to convert a categorical feature into a dummy variable.

125
Q

Techniques of feature selection:

A

Term frequency= The # of times the token appears in the datasetDocument frequency= The # of documents that a token appears in ÷ the # of documents. Chi-square Test= Ranks tokens by their usefulness to a certain class of info. Tokens with higher chi-square test-stat occur more frequently.Mutual information Test= A numerical value indicating the contribution of a token to a specific class. Tokens with less frequencies in a class compared to another class it will have a value close to 1, whereas if a token appears a lot in all classes it will have a value of 0.

126
Q

Techniques of feature engineering:

A

Numbers= Tokens w/ standard lengths are converted into new tokens. Ex: 4 letter words converted into ‘#4’.N-GramsName entity recognition (NER)= Assign tokens a NER tag based on their context. Ex: Europe-place ; Google-website.Parts of Speech= Assign tokens a POS tag based on their language structure. For example: Google- PPN (proper noun) ; 2000 - CDN (cardinal #).

127
Q

Procedures before model training:

A

The researcher must define the objective(s) of data analysis, identify useful data points, and conceptualize the model. Once a ML algorithm/method is selected, he should specify the hyperparameters.

128
Q

Common model fitting errors:

A

Small training samplesLow # of features in the model. This can lead to an underfitting problem because the model doesn’t have enough info to find patterns. Feature selection is important to mitigate underfitting and overfitting. Feature engineering can reduce underfitting.

129
Q

Three tasks of model training:

A

Method selection= choosing the right ML algorithm considering supervised/unsupervised learning, type of data, and size of data.Performance evaluationTuning

130
Q

What type of ML algorithm do we use for text, numerical, and image data:

A

Text= SVMs and Generalized linear models (GLMs)Numerical= Regression trees, CART methods, and classification methods.Image= Neural networks and deep learning networks.

131
Q

Techniques to measure model performance:

A

Error analysis: Errors in classification problems can be false positives (type 1 errors) or false negatives (type 2 errors). We build confusion matrixes for type 1 and type 2 errors.Receiver operating characteristic (ROC)Root mean squared error (RMSE)

132
Q

Precision metric

A

A way to evaluate the fit of an ML algorithm. It’s the ratio of true positives (not false positives (type 1 errors)) to predicted positives. Use the precision metric when the cost of a type 1 error is large.Calculation: True positives ÷ (True positives + false positives)

133
Q

Recall metric/ true positive rate

A

A way to evaluate the fit of an ML algorithm. It’s the ratio of true positives (not false positives (type 1 errors)) too all actual positives. Use when the cost of a type 2 error is large.Calculation: True positives ÷ (True positives + false negatives)

134
Q

F1 score

A

A way to evaluate the fit of an ML algorithm. It’s the harmonic mean of precision and recall.The higher the better.More appropriate than the model accuracy metric when there are class imbalances. Calculation: (2 * precision * recall) ÷ (Precision + recall)

135
Q

Receiver operating characteristic (ROC)

A

A curve that plots the tradeoff between false positives and true positives. The true positive rate (recall metric) is plotted on the y-axis, whereas the false positive rate is plotted on the x-axis. The area under the curve (AUC) is a value from 0 - 1. The closer the value is to 1 the higher the predictive accuracy of the model. AUCs = 0 mean it’s never right and 0.5 mean 50% of the time- just guessing. The higher convexity of the curve the higher its AUC.

136
Q

True or false: There is a tradeoff between bias error and variance error to where the model is overfitting and underfitting?

A

TRUE

137
Q

Fitting curve

A

A graph that plots error (in-sample error (training sample error) and out-of-sample error (cross-validation sample error) on the y-axis and model complexity on the x-axis. The graph shows two curves: a curve for training error and a curve for cross-validation prediction error.

138
Q

Ceiling analysis

A

An evaluation and tuning of each components in the model.Applied to complex models.

139
Q

What is the primary limitation of trend models?

A

The primary limitation of trend models is that they are not useful if the residuals exhibit serial correlation.

140
Q

True or false: The KNN is a parametric test?

A

False, it’s non-parametric: it makes no assumptions regrading the distribution of the data.

141
Q

What are LASSO models and regularization used for?

A

LASSO models are used to build parsimonious models and regularization is used for nonlinear models.

142
Q

What are SVMs used for?

A

Generates binary classifications, such as: classifying debt issuers into likely-to-default versus not-likely-to-default issuers, stocks-to-short versus not-to-short, and even classifying text (from news articles or company press releases) as positive or negative.

143
Q

What are KKNs used for?

A

Predicting bankruptcy, assigning a bond to a ratings class, predicting stock prices, and creating customized indices.

144
Q

What are CARTs used for?

A

Fraud detection in financial statements and selecting stocks/bonds.

145
Q

What are random forests used for?

A

Factor-based asset allocation and prediction models for the success of an IPO.

146
Q

True or false: NNs have an input layer node that consists of a summation operator and an activation function?

A

False, the hidden layer nodes (not the input layer nodes) each consist of a summation operator and an activation function; these nodes are where learning takes place.

147
Q

True or false: The coefficients on each dummy tells us about the difference in earnings per share between the respective quarter and the one left out?

A

TRUE

148
Q

True or false: The F-statistic enables us to make conclusions about how several independent variables affect a dependent variable?

A

False, it only allows us the reject the hypothesis that all regression coefficients are zero and accept the hypothesis that at least one isn’t.

149
Q

True or false: Serial correlation affects the consistency of regression coefficients?

A

FALSE