Chapter 5 - CLRM assumptions and testing Flashcards

1
Q

Recall the assumptions for classical linear regression model CLRM

A

1) Residuals have expected value 0
2) Residuals have constant variance sigma^2
3) Residuals have 0 covariance
4) Residuals and variables have zero covariance
5) Residuals are iid and normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Considering the assumptions, what do we want to understand about them+

A

We need to detect violations of the assumptions.

We need to understand what violations do to our CLRM

We need to know some classic cases that violate the assumptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the main outcomes of using a model even though the assumptions are violated?

A

1) Coefficient estimates can be wrong
2) the standard errors can be wrong
3) The distributions used for test statistics are wrong

Basically, all kinds of fucked up results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what do we call testing that concern itself with checking validity of assumptions?

A

diagnostic test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What alternatives do we have in regards to testing?

A

1) LM lagrange multiplier
2) Wald

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

shortly elaborate on LM testing

A

in this context, it will follow a chi squared distribitokn with degrees of freedom equal to the number of restrictions placed on the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

elaborate shortly on the Wald testing

A

distribution is F, and degrees of freedom is (m, T-k).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what can we say about LM and Wald

A

Asymptotically, they are the same. but results vary in small samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

elaborate on the first assumption

A

E[u_t] = 0

This will never be violated if we include a constant term.

However, if we enforce the line to go through the origin, we could get bad results.

Basically, what happens is that the slope must accomodate for the shit poisition of the line. This can give a slope that is no where near accurate, but gets placed like that because it minimize the errors regardless.

What could happen is that the regression line that was fitted end up being a worse fit than a simple average. This would then create negative R^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

elaborate on detecting heteroscedasticity

A

It is difficult, because one rarely knows the shape, to use plotting methods.

However, there are statistical tests.

Goldfeld-Quandt: Split the total sample into subsamples. The regression model is estimated on each subsample. Compute residual variance (s^2) of both cases using the known formula for sample variance of regression. The null hypothesis is that there is no difference in these variances.
The test statistic is the ratio of sample variances. It is F-distributed.

the weakness of the Goldfeld-Quandt is the requirement of a good split. We typically use the split in regards to a known event, structural event.
One can also remove a larger central portion of the splitting point to make the split more evident.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what do we mean by heteroscedasticity?

A

anything that entails not having constant variance of the residuals. For instance, if the residual increase in magnitude along the positive axis of one of the explanatory variables, then we can have average error being 0, but still have difference in variance depending on variable values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what term is used to explain assuption 2?

A

Homoscedasticity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

elaborate on White’s test

A

Whites test is a test for heteroskedasticity.

It is useful because it makes no assumptions on the shape of the heteroskedasticity.

Assume we have a regular linear regression.
We want to test var(u_t) = sigma^2.

Estimate the model. Get the residuals.

Then we create an auxillary regression where the residuals are the dependent variable. then, as independent variables, we include squares, cross products etc. The goal is to see whether we can explain movement in residuals by using the variables.

then, we could use the F-test approach, but this require more regressions. LM approach is typically easier.

LM approach for White’s test is basically using the fact that if one or more of the parameters (coefficeints) are statistically significant, they will show this with their R^2 (R^2 of the regression). R^2 will be larger than the R^2 of the case with no statistically significant coefficeints.

We obtain R^2 from the auxiliary regression, and mutkiply it by the number of observations. This statistic is chi squared with m degrees of freedom, with null of joint 0 for all coefficients.

So, we want to see low values for the statistic, because this shows that there is no evidence indicating that the R^2 is large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What happens to OLS if there is heteroskedasticity present?

A

we get unbiased estimators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is assumption 3?

A

Assume no autocorrelation between residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what tests do we have for autocorrelation in CLRM?

A

1) Durbin Watson
2) Breusch Godfrey

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

elaborate on Durbin Watson

A

DW is a first-order autocorrelation test. Test only the first lag.

The idea of Durbin Watson is to use the residuals to create a new regression that basically check whether the coefficeints of the new regression are statistically significant or not. The null hypothesis is that teh coefficient is 0.

u_t = p u_(t-1) + v_t

We’re only testing for p.

Now, we do not actually have to run the regression. We have the values we need from running the original regression.

DW does not follow a distribution. Instead, it operates on a region, and use critical values.

DW is a test for whether consecutive errors are related. Sort of limited.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DW is a test to see whether consecutive errors are related. Can we do better?

A

Yes, with Breusch godfrey.

In theory, we could also use DW, but replace the lag-1 wiht all kinds of lags. but this is not practical.

Breusch Godfrey is a joint test for multiple lags at once.

1) Estimate OLS like always to find the residuals
2) Use the residuals to build an auxiliary regression. The explanatory variables are lagged residuals, but we also include the intercept, and the regular explanatory variables from the regualr regression. Using later knowledge, this is to remove dependencies from the residual.
3) Obtain R^2 from the new regression.
4) (T-r)R^2, where T is the number of observations and r is the order of lags, this shit is chi squared distributed with r degrees of freedom.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

why do we multiply R^2 by (T-r) in Breusch-Godfrey, and not by just T as it is with regular heteroskedasticity?

A

when we use the lags up to order k, we lose k variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

how do we deal with testing for normality of resiuduals?

A

Bera-Jarque test. BJ test.

19
Q

elaborate on the Bera-Jarque test

A

It utilize the fact that the normal distribution is defined purely on its first 2 moments. Skewness and kurtosis is not changeable. skewness is always 0 and kurtosis is always 3.

We define excess kurtosis to be equa lto kurtosis less 3. Then, the test will test the joint hypothesis that both the skewness and the excess kurtosis is 0. If so, normality is assumed. If the test values show extreme values, it indicates that the result was highly unlikely to observe given the fact that the shit is normally distributed, which we use as evidence that it is not normalyl distributed.

20
Q

what happens if we find that shit is not normally distributed?

A

it is not striaghtforward to know what to do.

If the sample is very large, this has nothing to do, andwe can use the model without worry.

SOmetimes, logging the variables can help.

Other times, some extreme outliers, that are not really representative of the pattern, will fuck up the test. we could remove these outliers to aid the model.

In general, removal of outliers is dangerous. Perhaps the outlier is not actually an outlier, but we just lack data in this region. we usually say that removal is only justifiable if we have some sort of evidence that suggests that the event was a one-off.

21
Q

how can we remove outliers?

A

We can remove outliers by adding binary (dummies) variables only for the outlier.

22
Q

elaborate on multicollinearity

A

implicit assumption made during OLS.

Explanatory variables cannot be correlated.

Multicollinearity refers to cases where there is a higher degree of correlation between variables.

23
how do we know if variables are truly orthogonal to each other?
Running a regression, then remove a variable and run again, and observe that the remaining coefficients does not change from their first reuslts. This would indicate orthogonality.
24
what happens when there is perfect multicollineairy?
1) There is an exact relationship between two or more variables 2) The model is not solvable, as we have linear dependent matrix. We'd only have enough information to estimate some paramters, (numjber of independent columns in the X vector), not all of them
25
how do we measure/test for multicollineairy?
Difficult. we can only investigate by checking the correlation matrix. High values indicate the presence of something.
26
what happens if we have presence of multicollinearity, but we ignore it
we might get a good R^2 value, but the standard errors are high as well. this means that the coefficients are likely not statistical significant. this arise from the difficulty in observing independent contritubutions.
27
how to deal with multicollinearity
transform the data into better data. use PCA
28
what does it mean when we say that "under these cirsumstances, the estimate of the coefficient will be biased"?
simply means that the expected value is not equal to the true value. So, in the context of linear regression coefficients, if the estimates are biased, then we are missing.
29
what happens if we exclude an important variable in our model?
There are two ways this can go. 1) The variable has some correlation do the ones that remain. As a result, the coefficients of the remaining variables that are correlated with the missing variable will be biased to account for the missing evidence. 2) If the missing variable is completely uncorrelated to the ones we include, then the constant term, the intercept, will be biased.
30
recall the formula for sample variance of regression
S^2 = RSS / (T - k)
31
recall the formula for standard error
the variance of a oefficeint is given by: var(b) = s^2 (X'X)^(-1) Note that this gives a full matrix. Our values are along the main diagonal
32
what is important by Hendry's strategy?
the model must be consistent with the data and with the theory. The specific-to-general is not really using theory the correct way, and sort of represent a more "try shit out" strategy.
33
what is the encompassing principle?
A model should be able to explain everything that a rivaling model can. as a result, the model should not be a subset of a better model.
34
why general to specific?
I believe it is mostly about basing it on a valid theory. Include all variables we believe are important, test for statistical validity, and perform constraints etc with encompassing pricniple etc. The other approach is more path dependent and is naturally inclined to test more models. This risk going into the trap where statistical adequacy is passed by chance.
35
elaborate on the easiest testing for heterskedasticity
There is the classic Goldfeld-Quandt test. We split the sample into two subsamples based on a specific ordering of the data. Typically associated with time variable, or something else. Then we solve the model. then we use the solved model on each sub sample. Find residuals of both. Use residuals to compute variance of hte residuals for both samples. This is the regression variance. S^2 = u'u / (T-k) Since the residuals are assumed normally distributed, this is basically chi squared with T degrees of freedom. However, we use k, so we must subtract. Then we take the ratio of regression sample variances to get the F-distributed statistic. GC = S_1^2 / S_2^2 The null is that they are equal to eachother. NOTE: requires good udnerstanding of where to place the break point in to samples
36
Elaborate on the second way to test for heteroskedasticity
White's test. Sometimes considered better than GQ, because it doesnt rely on the break point information. Run the regular regression. Find residuals. use residuals as dependent variable in a new regression, where the independent variables are all kinds of variations and cross products of the regular regression's independent variables. The goal is to capture relationships between the residuals and magnitudes of various degrees in regards to the variables. NB: The new regression predict squared residuals! Not the residual itself. There is a crucial reason for this: What we really want to predict wit hthe new regression, is the variance of the residual. we want ot see whether the variance of residuals can be explained to any degree by explanatory variables, cross products etc. however, we we use teh assumption that the expected value of the residual is 0, then the formula for variance, E((u_t^2) - E(u_t)^2) = E(u_t^2) NOW WE HAVE 2 OPTIONS: 1) F-test (WALD) 2) Chi squared (LM) If we use the F-test, we would need to use 2 regressions. One is the new one we just used for the variance of the residual. The second would be a regression that is just the constant (intercept) term. This is because if there is no heteroskedastic patterns, only the constant term will provide statistical adequacy, and there would be no differnece in using either of the two models. Therefore, we'd test both against each other, where the model with only the constant term is the restricted model. We'd make use of the fact that residual sum of squares is chi-squared with T-k degrees of freedom. However, it might be easier to use chi squared variant. this involve using the R^2 from auxuliary regression, multiplying it by the number of data points, and using the fact that this is chi sqwuared with m degrees of freedom, where m is the number of regressors. The point here is that R^2 should be low because the model should not be able to explain anything but the average level.
37
what happens if we ignore heteroskedasticity?
still unbiased, but no longer efficient
38
what does DW test for?
Autocorrelation in the reisduals, basically
39
elaborate on DW
THe idea is to test for autocorrelation. Consider the case where a residual is either +1 or -1. Then assume that the pattern is literally 1,-1,1-1,... With this pattern, the difference between a residual, and its lag-1 corresponding variable value, would be 4,4,4,... if the pattern was more lke this: 1,1,1,1,-1,-1,-1, the result would be 0,0,0,4,0,0... and if we remove the hard break, at say that each residual is very close to its previous one, the series is like 0.1, 0.1, 0.1 ..... So there are basically two cases: 1) Next residual is far away repeadetly from the current one 2) Next residual is close BOth of these represent high autocorrelation. THe "no-autocorrelation" scenario would be more like white noise. so we use the DW test statistic: ∑(u_t - u_(t-1))^2 / ∑u_t^2 DW has no distribution. But we need to find critical values based on the number of observations and hte number of regressors. NB: DW critical value is so that we want to be within the range [d_upper, 4 - d_upper] to reject, and [d_lower, 4 - d_lower] to be inconclusive.
40
elaborate on BG
Breusch godfrey.
41
elaborate on what happens if we ignore autocorrelation
if there is presnet autocorrelation, and we ignore it, our estimators are still unbiased. However, like with heteroskedasticity, they are no longer efficient (best ones available).
42
elaborate on OLS and violating assumptions in general
the idea is that OLS will produce a model, which is not necessarily a good model. And under a certain set of assumption, the OLS estimator will prpoduce a model that has certain characteristics. We use the acronym BLUE to remember these. best,linear, unbiased, efficient. All this means, is that if we have some data, and the data satisfy the assumptions, then using OLS gives us a model that is BLUE. This, however, has nothing to do with performance. The model could still suck. If we have a white noise case, the model would suck. But we would not be able to find a linear estimator that is better. In the case of violating assumptions, certain things happen. For instance, with autocorrelation, we get structures in the data that a linear estimator might not be able to detect. We could also get patterns and structures that OLS is not able to claim being "best" at. In regards to the specific autocorrelation case of two linear lines, I suppose we could, given that we know this structure, define dummy variable that allows us to basically add one more intercept. However, OLS is about not knowing anything about the data, and providing certain guarantees. By adding the dummy, we would (I believe) get the BLUE case, but it would involve altering the data, so it is not really BLUE. regarding the assumption of constant variance: If the variance incresae with larger values of one explanatory variable, we could get a pattern that looks like brownian motion with drift. The linear line from OLS would still capture the trend average, and get unbiased property, but will not be able to capture the pattern in the best way. this is because there is a structure in the variance, and by using it, we would get better approximations.
43
elaborate on multicollinearity
An implicit assumption made when using OLS is that variables are uncorrelated. In practice, ther is usually some degree of correlation. However, this is not really an issue unless the correlation is large. If we have perfect multicollinearity, then the systme is simply not solvable. Specifically, we will not be able to invert the (X'X) matrix because it does not have full rank, as the columns are not all linearly independent of each other. If the problem is large, but not perfect multicolluineairty, we say "near multicollinearity". This is a big problem.
44
reflect on X'X
we have a data matrix X. When we take the dot product of X and X, we compute X'X. if X's columns are mean centered (zero mean for each column) then multiplying by 1/n gives sample covariance matrix.
45
what is the main issue of multicollineairty
Large standard errors. This can make it difficult to perform good tests. This happens as a result of there being difficulties in assigning controbituion to individual variables.
46