u

the variation in yi that is not captured or explained by xi → this could include both unsystematic predictors of yi (e.g., a job application randomly landing on the top or bottom of a stack of other applications) and systematic determinants of yi (e.g., years of prior experience) that are omitted from the model

PRF vs. SRF

PRF: E(yi) =β0+β1xi

SRF: yˆi = βˆ0 + βˆ1xi

Errors and residuals for SRF

Notice that there is no estimate of uˆi for the SRF because yˆi, by definition, is the regression line (same logic applies to the PRF)

The estimates of the errors, which are called the residuals, are the differences between observed values of yi and the predicted values yˆi :

uˆ i = y i − yˆ i = y i − βˆ 0 − βˆ 1 x i

OLS

OLS is the most commonly used estimator in the social sciences (to find beta)

OLS will be our workhorse estimator in this course

OLS obtains estimates of the “true” population parameters β0 and β1, which we typically do not observe

The logic of the OLS estimation procedure: choose βˆ0 and βˆ1 that minimize the sum of squared residuals: uˆ2

Why minimize the sum of squared residuals, instead of the sum of residuals or the absolute value of residuals?

If we use the sum of residuals, then residuals with different signs but similar magnitudes will cancel each other out

Minimizing sum of absolute values is a viable alternative but does not generate formulas for the resulting estimators

Relationship between PRF and SRF through residuals

y i = yˆ i + uˆ i

SST

Total Sum of Squares (SST): Measure of sample variation in y

SSE

Explained Sum of Squares (SSE): Measure of the part of y

explained by x

SSR

Residual Sum of Squares (SSR): Part of the variation in y unexplained by x

R^2 and magnitude of relationship between y and x

As a measure of correlation, R2 should not be confused with the magnitude of the relationship between a DV and IV

You can have a bivariate relationship that has a high R2 (i.e., high correlation), but that has a slope that is close to 0

You can also have a bivariate relationship with a low R2 (i.e., low correlation), but that has a slope that is high in magnitude

What changes when you transform a regressor?

Bottom line: if we transform a regressor, then only the slope coefficient for that regressor is transformed

What happens if the relationship between wage and education is non-linear?

These patterns can be nicely modeled by re-defining the dependent and/or independent variables as natural logarithms

The linear regression model must be linear in the parameters, but not necessarily linear in the variables, so logging Y or X shouldn’t violate our requirement of a linear relationships between the dependent variables and its determinants

Linear regression models?

y = β0 + β1x + u log(y) = β0 + β1x + u log(y)= β0+β1log(x)+u y = log(β0+β1x+u) e^y = β0+β1 √x+u y = β0+ (β1x1)/(1 + β2x2) + u

y = β0 + β1x + u Yes

log(y) = β0 + β1x + u Yes

log(y)=β0+β1log(x)+u Yes

y =log(β0+β1x+u) Yes

e^y =β0+β1 √x+u Yes

y=β0+ (β1x1)/(1 + β2x2) + u No

1 If we exponentiate both sides of this equation, we get: e^y = β0 + β1x + u, which is linear in the parameters

wage= β0 + β1educ + u “ “

1 additional year of education is associated with an increase in wages

of β1 units

wage= β0 + β1log(educ) + u “ “

1% increase in education is associated with an increase in wages of β1/100 units

decreasing returns

log(wage)= β0 + β1educ + u

1 additional year of education is associated with a (100 ∗ β1 )% increase in wages

increasing returns

log(wage)= β0 + β1log(educ) + u

1% increase in education is associated with a β1% increase in wages

Assumption 1

Linearity in the parameters

the population model can be non-linear in the variables but must be linear in the parameters

Assumption 2

Random Sampling

individual observations are identically and independently distributed (i.e., observations are randomly selected from a population such that each observation has the same probability of being selected, independent of which other observations were selected.)

Assumption 3

Sample Variation in the explanatory Variable

the sample standard deviation in xi must be greater than 0 (need some variance in order to get an estimate)

Assumption 4

Zero Conditional Mean

E(u | X) = 0

if it holds, then the error term u is uncorrelated with the regressor X

this assumption is usually the biggest area of concern in empirical analysis

Assumption 5

Homokedasticiy assumption

Var (u | X) = sigma^2

the variance of the unobservable error term, conditional on x, is assumed to be constant

Var(u) is independent of x

If assumptions 1-4 hold….

the OLS estimator is unbiased, meaning that on average, E(beta-hat) = beta

If assumptions 1-5 hold

if 1 - 5 hold, then we can derive a formula for the variance of the coefficient estimates, Var(beta-1)

What drives the variance of the OLS slope estimate? What makes it more precise?

the lower the variation in the errors

or the greater the var in the indept variable

or the greater the same size (relatedly b/c sample variation increases with sample size)

then the more precise the OLS estimates, on average

Standard errors measure… relationship with percision

precision or efficiency of the estimate beta1-hat

se-hat (beta1-hat) is lower (i.e. more precise) when:

the residuals are small

the variation of the independent variable is large

the number of observations is large

Motivation for multiple regression analysis

1) controlling for other factors:

even if you are primarily interested in estimating one parameter, including others in the regression will control of potentially confounding factors, the zero conditional mean assumption is more likely to hold

2) better predictions:

more independent variable can explain more of the variation in y, meaning potentially higher R^2

3) Estimating non-linear relationships:

by including higher order terms of a variable, we can allow for a more flexible, non-linear functional form between the dependent variable and an independent variable of interest

4) Testing joint hypotheses on parameters:

can test whether multiple independent variables are jointly statistically significant

How does OLS make estimates?

minimizes the sum of the squared residuals, combo of all the betas that gives the lowest sum of squared residuals

How do you isolate the variation that is unique to x3?

regress x3 on all the other regressors and obtain the residuals, e would contain the variation in x3 not explained by the other regressors from the initial population model, effectively holding all else constant

then we conduct a bivariant regression of y on e-hat

betak-hat represents what “ “

in terms of slope?

the partial association between y and xk holding x1, x2,…, xk-1 equal

beta-k–hat would be the slope of the multi deminsion plane along the x-k direction (ie the expected change in y when x1 increases by one unit, holding all other x’s constant

What does R^2 NOT tell you

high R^2 doesn’t mean that nay of the regressors are a true cause of the dependent variable

also does not meant that any of the coefficients are unbiased

Estimating non-linear relationships

by including higher order terms of an independent variable we can allow for a non-linear, or “more flexible functional form” between the dependent variable and an explanatory factor

include x^2 as a regressor, take the partial derivative, gives you the total effect of x in two parts (linear and non-linear)

if the first and second terms are substantively and significantly different from 0, then we have a situation where the sign and magnitude of the effect on wages can vary as x changes = non-linear relationship

“marginal return to x”, no ceteris paribus interpretation of individual parameters here, we must choose a given level of x and then describe the trade off

“for an individual with ten years of experience, accumulating an additional year of experience is expected to increase his/her hourly wage by $0.18

When can you interpret coefficients as causal?

when assumptions 1-4 hold, but usually ZCM fails, but it is more likely to hold with multiple regression

R-j^2, what is it? What does a high value mean?

it is the R^2 from regressing x-j on all of the other independent variables

a high R-j^2 is often the result of multicollinearity, high but not perfect correlation between two regressors

lead to imprecise estimates

When adding x3 to the regression model, what happens to var-hat( beta-1-hat)?

two countervailing channels:

will will almost certainly reduce sigma-hat^2 (squared residuals), which will make the estimate more precise, this reduction will depend on the extent to which x3 predicts y

by adding x3 we also introduce some correlation and perhaps multicollinearity between x1 and x3, or/and x2 and x3, which works against a more precise estimate of var-hat( beta-1-hat), this will depend on how correlated the regressors are

Gauss Markov Thm holds when?

Assumptions 1-5 hold, this means that the beta-hats are the best linear unbiased estimators (BLUE)

B-P tests evaluates.. and null hypothesis

whether the regressors jointly and significantly predict the variation in the squared residuals.

H0: error term doesn’t depend on regressors: beta-1-hat = beta-2-hat…. = 0

White test vs. B-P, if you reject the null in one…

White test is harder to pass because it tests for a linear and non-linear relationships between u^2 and all x-j. therefore, you are more likely to reject the null with the white test, therefore if you reject with the B-P test you will always reject with the White.

OVB is equal to..

the product of beta-2-hat and gamma-1-hat where beta2-hat is the estimated slope coefficient for x2 in the true model and gamma-1-hat is the slope coefficient of a bivariate regression of x2 on x1, hence if gamma-1-hat does not equal 0 there is a partial association between x2 and x1

OVB positive or negative?

corr(x1, x2) > 0 corr(x1, x2) <0

beta-2-hat > 0 + bias - bias

beta-2-hat < 0 - bias + bias

+ bias = beta-1-estimate over estimates true beta-1-hat

usually will end up causing a bias problem for all coefficients

bias-efficiency tradeoff, which is worse when?

bias is worse if you are making causal inferences (this is generally more important)

if you are making predictive inferences then imprecision (higher standard error) is worse

calculating OBV

(beta-hat-newly included regressor) * (gamma-hat-original biased variable when “regress new old”)

or

beta-hat-educ (before you took out IQ) - beta-hat-educ (after you include IQ as a regressor) = OBV

how does sigma^2 relate to u?

sigma^2 = var(u-hat) = the average of the (unobserved) sum of squared errors

therefore,

sigma^2 = E(u-hat^2) = the average of the (observed) sum of squared residuals (adjusted for k = 1 restrictions)

the Gauss Markov Assumptions guarantee that the error term u….

Assumptions 1-5

error term u, conditional on x, has a mean of 0 and a variance of sigma^2

beta-hat has a mean of beta and a variance of var(beta-hat)

Assumption 6

beta-hat ~ normal (beta, var(beta-hat)) = beta-hat is normally distributed

normal sampling distribution of beta-hat for any sample size, even small samples

characterize beta-hat as a t-distributed random variable, therefore can use the t-stat to evaluate hypotheses

Error term is normally distributed.

If we had a small sample, we would additionally require the error term to be normally distributed. However, by virtue of the large size of the sample studied in this exercise, we can conclude that the error term has an approximately normal distribution. This follows from the Central Limit Theorem

alpha

the probability that the H0 is mistakenly rejected due to sampling error

the probability of obtaining a statistically significant association by chance when there is not in reality a statistically significant association

can you change your H1 based on your results?

NO, this is considered to bias inferences and goes against the spirit of hypothesis testing

describing a statistically significant t-test result

“the association between x and y is (positive/negative) and statistically significant at the 5% level, ceteris paribus

“we reject the null hypothesis at 5% level (after stating H0)”

statistical significance vs. economic significance

economic sig tells up about the magnitude of the coefficient relative to the sample mean of the dependent variable

ex. if the coefficient of years of education on annual wages is $5,000 and the average wage is $30,000, we would say that the effect represents $5,000/$30,000 = 16.7% of the average dependent variable

in words what does p-value of 0.02 mean?

“under the null hypothesis, the probability of obtaining a statistically significant result by chance is 2%

CI formula

at 95% confidence: beta-hat +/- C of 97.5 * se-hat (beta-hat)

CI meaning and if only working with one sample..

if random samples were obtained over and over again and CIs were calculated each them, then the (unknown) beta would lie in the respective CIs of 95% of these samples

we often only work with one sample, so we do not know for sure whether our (random) sample is one of the 95% of samples where the interval estimate contains beta

F test conclusion in words

“the regressors are jointly statistically significant at the alpha level”

partial relationship using OVB formula

beta1-~ = beta1^ + beta2^ * gamma1^

where x1 = included variable, x2 = omitted variable and gamma is the partial association between x1 and x2

Assumption 6 and large samples

normality of the error term, unlikely to always hold so we relax this assumption in large samples

therefore if sample size is sufficiently large, you don’t need assumption 6 to run hypothesis testing

Consistency of the OLS Estimators

consistency is used to describe how the distribution of any estimator changes as the number of sample observations from a population increase.

property in which the distribution of an estimator “converges” of becomes more concentrated around the real population parameter as the sample increases, even if the OLS estimator is biased.

as sample size increases, the E(beta-hat) becomes increasingly close to beta, and the distribution of E(beta-hat) becomes increasingly narrow

when sample size goes to infinity, the distribution of the estimator becomes a single point - true beta

consistency and/or/vs. unbiased

theoretically possible to have an unbiased but inconsistent estimator

more commonly, we have a biased but consistent estimator, as sample size increases the mean of the distribution of beta-hat converges on beta

Law of Large Numbers: it can be shown that if assumptions 1-4 hold, beta-hat is not only an unbiased estimator of beta, but also a consistent estimator

heteroskedasticity means..

that the variance of the error term systematically varies with, or depends on the explanatory variables, or any combination or function thereof

Breusch-Pagan Test definition and steps

evaluates whether u^2 is linearly associated with (x1, x2…., xk)

1) obtain the OLS residuals u^ form the original regression

2) compute u-hat^2

3) regress u-hat^2 on (x1, x2,…, xk)

u-hat^2 = gamma0 + gamma1*x1 + gamma2*x2,…., + E

get R^2 for u-hat^2

4) conduct an F-test for the joint sig of gamma1, gamma2….

White test tests..

whether u^2 is systematically and jointly related to the regressors, their squares, and their interactions

problem with white test and short cut

it can eat up a lot of DF when we have many independent variables

short cut: u-hat^2 = gamma0 + gamma1*x1 + gamma2*x2… + v

H0: gamma1 = gamma2 = 0

intuition behind robust standard errors

it is very often the case that the variance of the error term u depends on the regressors, but we don’t know exactly what form the heterokedasticity takes, remedy this situation with robust standard errors

Robust standard errors, what changes?

standard error changes, so test stats change

coefficient estimates will not change, R^2 will not change

don’t use if homoskedastic

what is meant by “the estimated coefficient of educ now has partiatl association interpretation”?

the coefficient of educ represents the association between cigs and years of educ after partialling out the shared variation between educ and the other regressors. We can now identify how wages vary with education while holding the other covariates constant. This feature of multivariate regression allows us to adopt a ceteris paribus interpretation of educ.

Explain, in words, why the Breusch-Pagan test is a test of the homoskedasticity assumption.

Homoskedasticity is the assumption that the variance of the error term is independent of the model regressors. The Breusch-Pagan tests evaluates the credibility of this assumption by evaluating the relationship between the squared residuals (in lieu of the squared errors, which we do not observe with sample data) and the model regressors. If the model regressors are strong joint predictors of the squared residuals, this would cast doubt on the assumption that squared errors are independent of the regressors.

If you had run the regression using heteroskedastic robust standard errors, would the estimated coefficients have been different? Why or why not?

The unbiasedness of OLS does not depend on the credibility of the homoskedasticity assumption. Consequently, if we had used heteroskedastic robust standard errors, the estimated coefficients would not have been different. However, the robust standard error estimates clearly would have differed

lwage = γ0 + γ1educ + γ2exper + γ3exper2+

γ4urban + γ5motheduc + γ6fatheduc + γ7nc + γ8west + γ9south + γ10educ_nc + γ11educ_west + γ12educ_south + ε

γ11 measures the …

γ1

γ11 measures the … difference between returns to education in the west and that of the northeast

γ1 measures the return to education in northeast only

LDVs are…

LDVs are restricted variables – they can take on only few values (e.g. a few integers), have a restricted range (e.g. non-negative values), binary outcomes

Problems with LPM

1) LPM predicts probabilities less than 0 (0%) and greater than 1 (100%)

2) LPM not particularly good at “fitting” binary outcomes

• LPM assumes that parents’ education will have the same effect at high levels (when a child is already likely to be placed in a good school) than at moderate levels → LPM doesn’t “bend” to fit data

3) LPM is heteroskedastic

This can be seen by the distribution of the residuals around the regression line

In fact, by construction, LPM always violates the homoskedasticity assumption

Unadjusted standard errors will be biased and, consequently, statistical inference will not be valid

Advantages of LPM/Solutions to Issues

Transparent and easy to interpret

1) LPM predicts probabilities less than 0 (0%) and greater than 1 (100%)

• As long as Assumptions 1 through 4 hold, the estimators are still unbiased

• While predicted probabilities less than 0 and greater than 1 are clearly problematic, LPM works well for values close to average

2) LPM not particularly good at “fitting” binary outcomes

• You can always model non-linear associations by including higher-order terms or specifying regressors as logs

3) LPM is heteroskedastic

•Heteroskedastic robust standard errors are one way of retrieving unbiased standard errors

Functional Form Misspecification

The problem is that if the particular specification you estimate does not capture the appropriate functional form for the relationship in which you’re interested, the zero conditional mean assumption will be violated

Functional Form Misspecification, how do we know whether one specification is better than another?

Adjusted R 2 (also written as R 2 )

Davidson-MacKinnon Test

F -Test for Evaluating Nested Models Regression Specification Error Test (RESET)

Adjusted R2

A more rigorous way of evaluating the predictive power of a particular specification penalizes the inclusion of additional variables to a model

Adj R2 =1− [SSR/(n − k − 1)]/[SST/(n−1)]

In small samples, however, the R2 allows us to compare the predictive power of different specifications

However, the R2 is not particularly useful in large samples

◦ The difference between R2 and R2 is indistinguishable when n is very large

Also note that the R2 does not help us all that much in conducting hypothesis tests, particularly joint significance tests

◦ F-test of joint significance uses the R2, not the R2

only use the Adjust R2 for comparing nested models and non-nested models that have the same outcome variable

The Davidson-MacKinnon Test, when to use it and intuition behind it

The Davidson-MacKinnon Test can be used to evaluate nested and non-nested models with the same dependent variable

The intuition behind the Davidson-MacKinnon Test: if a particular specification is appropriate, then the fitted values of some alternative specification should not be significant predictors of the outcome variable

The Davidson-MacKinnon Test, steps

(1) y =β0 +β1x1 +β2x2 +u

(2) y = β0 + β1log(x1) + β2log(x2) + u

Estimate Equation (2, the alternative model) and compute the fitted values yˆ2◦ Estimate Equation (1) and add yˆ2 as a regressor: y = β0 +β1x1 +β2x2 +δyˆ2 +u

Our null hypothesisis H0 :δ=0

◦ Now conduct a t-test on δ

• If we fail to reject H0 → then there is evidence that Equation (1) iscorrectly specified

• If we reject H0 → then there is evidence that Equation (1) is misspecified

There is a problem with this test, however → we might reject both tests or fail to reject both tests

In that case referring to the adjusted R2 can be a good idea

RESET, when to use and intuition behind it

Advantages and Disadvantages

Regression Specification Error Test (RESET)

To evaluate the specification of any two nested or non-nested models, we can also implement the Regression Specification Error Test (RESET)

Suppose we have a standard population model

y =β0 +β1x1 +β2x2 +…+βkxk +u

The intuition behind RESET is that: if a model is properly specified, then non-linear functions of the regressors (i.e., higher order and interaction terms) should not be statistically significant predictors of the dependent variable

Limitation of RESET: No clear guideline of how to proceed if we reject H0

Advantages of RESET: We can conduct RESETs for a set of non-nested models and keep whichever model does not reject H0

RESET, steps

We can therefore conduct the RESET by:

1) Generating yˆ2 and yˆ3 from the model we want to evaluate, and then plugging them back into the model:

y =β0 +β1x1 +β2x2 +…+βkxk +δ1yˆ2 +δ2yˆ3 +u 2

2) conduct an F-test with the null hypothesis

H0 :δ1 =δ2 =0 H1 :H0 does not hold

If we fail to reject H0 → our original model captured all non-linear relationships between the dependent and independent variables, and our model is therefore correctly specified

If we reject H0 → There are non-linearities that we haven’t accounted for, and our model is therefore misspecified

Bad Control

We know that omitted variables is often a threat to the zero conditional mean assumption

But! Adding more regressors might not always be a good idea It is possible to have a bad control problem

This is a problem in which one of your controls is an outcome of another regressor

This can complicate the ceteris paribus interpretation of multiple regression

Ex. can’t include both alcohol consumption and alcohol tax in the same model because the tax directly effects consumption, the ceteris paribus condition breaks down: we can’t hold alcohol consumption constant if taxes change

Measurement Error, types and consequences for OLS when in the dependent variable

• Most variables that we analyze in empirical research have been measured imprecisely for a variety of reasons

◦ Mis-coding

◦ Inappropriate measures of specific concepts

◦ Limited or uneven capacity to measure social, political, economic phenomena

◦ Variables that represent averages of more complex, granular information

Consequences for OLS:

the OLS estimates are less precisely estimated

This is because our our error term, ε contains both u and the measurement error, so that Var(ε) = Var(u) + Var(e0) > Var(u)

As a result, the error variance is higher when we have measurement

error in the dependent variable

The higher the variance of the measurement error → the higher the estimated error variance → Higher standard errors for all of the OLS estimates → Lower statistical significance for all of the OLS estimates

Measurement Error in the Independent Variables

Measurement Error in the Independent Variables

- So, when we use the mismeasured independent variable educ, then the ZCM Assumption breaks down
- This situation is called Classical Errors-in-Variables (CEV)

• Because the ZCM Assumption is not satisfied, OLS is biased and

inconsistent

- It can be shown that the direction of this bias is predictable
- Specifically, classical-errors-in-variables gives rise to attenuation bias, which means that the estimate βˆj is biased toward 0 (i.e., towards finding a practically smaller association or effect)
- With CEV, the OLS coefficients are also imprecise

This is not just a problem for the mis-measured regressor

If the mis-measured variable is cross-correlated with the other regressors, which is very likely, then all of the other regressors will be biased toward 0

What can we do about measurement error?

What can we do about measurement error?

- Data cleaning (i.e., check the data and re-code any identified mistakes)
- Instrumental variables estimation

If the data is missing at random…

◦ Then the random sampling assumption still holds

◦ OLS still unbiased and consistent

◦ But (randomly) missing observations reduce the sample size available

for a regression → less precise estimates

Instrumental Variables, Conditions

Suppose that we are interested in identifying the effect of x1 on y, but an omitted variable has resulted in a biased OLS estimate βˆ1

- Then an IV estimator, named z, is a variable that does not show up in Equation 3, but must relate to it in two different ways

1) The instrumental variable z must be “exogenous” to the outcome y

Cov(z,u) = 0: The instrument is uncorrelated with the error term (“Instrument Exogeneity Assumption”)

• This condition means that z is “as good as random” to the outcome variable

We can’t directly test assumption 1: We don’t observe u so we cannot definitively confirm or deny whether it is uncorrelated with a proposed instrument

Researchers have to make a good argument for why an instrument is exogenous, backed up by quantitative tests of the exogeneity assumption (which we will talk about in the next lecture)

2) The instrumental variable z must be a good at predicting variation in the independent variable x1

Cov(z,x) ̸= 0: The instrument is correlated with the independent variable for which we want to instrument (education in the previous example) (“Instrument Relevance Assumption”)

• By “correlated,” we mean that the instrument is a substantively and statistically significant predictor of the instrumented variable

the more correlated, the better

We can direction test assumption 2, If we regress the instrumented variable x on the instrument z (while controlling for the other regressors in the main population model), then we can directly observe the partial association between x and z

The direction of the relationship should make sense

Interprete βˆnjpost (DID)

The estimated coefficient βˆnjpost represents the change in the number of employees per restaurant in New Jersey relative to the change in the number of employees in Pennsylvania.

Why DID is better (NJ and Penn ex.)

This is because it accounts for pre-existing differences between New Jersey and Pennsylvania that did not change over the course of the policy process—for instance, class composition or geography. In the previous model, we were not be able to differentiate between differences in employment between New Jersey and Pennsylvania that were attributable to pre-existing differences or to the policy change enacted in New Jersey. With the difference-in-difference model, however, we can estimate how employment outcomes changed in New Jersey relative to Pennsylvania over time, giving us a more reliable estimate of the impact of the policy change.

Advantages and Disadvantages of Probit

The advantages of the Probit model (and maximum likelihood estimation models in general) relative to LPM are several. First, Probit models always predict probabilities bounded by 0% and 100%, while the LPM can predict probabilities outside of this range. Second, the Probit model automati- cally captures non-linear effects because it estimates changes in the probabilities implied by (linear) changes in the z-scores under a standard normal probability distribution. Third, the Probit model intrinsically accounts for heteroskedasticity, which means that estimated standard errors are always unbiased.

The disadvantages of the Probit model relative to LPM are that Probit estimates do not have a straightforward interpretation (they correspond with z-scores under a standard normal probability distribution), and that they rely on the assumption of a normal distribution of the error term, an assumption that we do not need to make under OLS when the sample size is asymptotically large. One consideration of the LPM is that, while it does not always predict probabilities bounded by 0% and 100%, it can reliably capture the ceteris paribus effect of a particular regressor on a binary outcome for observations with values close to the average for the regressor of interest

RESET Intuition and how you test

The intuition behind the RESET is that if a model is properly specified, then non-linear functions of the regressors (i.e., higher order and interaction terms) should not be statistically significant predictors of the dependent variable. We can obtain particular non-linear transformations of the covariates without expending a large number of degrees of freedom by computing and including yˆ2 and yˆ3.

Under the RESET, the null hypothesis is that the non-linear transformations of the regressors yˆ2 and yˆ3 are not systematically associated with the outcome variable.

F test of R^2 in restricted and unrestricted

Intuition of Over ID

Under the test of overidentifying restrictions, the dependent variable is the residual from the IV 2SLS estimation. This residual captures the endogenous variation in lwage—that is, the variation in logged wages not explained by the exogenous predictors (by assumption) from the second stage equation. Consequently, if one of the instrumental variables is statistically similar to the other in predicting the IV 2SLS residuals, after controlling for the exogenous controls from the structural equation, then this would provide evidence that the instruments are indeed exogenous

Why SE higher in IV vs. OLS

IV 2SLS strategy relies on a portion of the total variation in the endogenous variable that is predicted by the instrumental variable, whereas OLS employs the total variation in the endogenous variable

libcrd as valid instrument for educ?

For libcrd14 to be a valid instrument for educ, it must satisfy the instrument exogeneity and instrument relevance assumptions. For the instrument exogeneity assumption to hold, owning a library card at the age of 14 must be “as if” random to wage outcomes. That is, individuals with different income levels, educational achievements, and other relevant characteristics are equally likely to have held a library card at the age of 14. Yet another way of saying this is that library card ownership should not exhibit any direct effect on wages after controlling for the covariates in the structural equation.

For the instrument relevance assumption to hold, possessing a library card at adolescence needs to exhibit a statistically and substantively significant association with subsequent educational achieve- ment. If both of these assumptions hold, it follows that library card possession influences wage outcomes only through its effect on educational attainment.