the variation in yi that is not captured or explained by xi → this could include both unsystematic predictors of yi (e.g., a job application randomly landing on the top or bottom of a stack of other applications) and systematic determinants of yi (e.g., years of prior experience) that are omitted from the model
PRF vs. SRF
PRF: E(yi) =β0+β1xi
SRF: yˆi = βˆ0 + βˆ1xi
Errors and residuals for SRF
Notice that there is no estimate of uˆi for the SRF because yˆi, by definition, is the regression line (same logic applies to the PRF)
The estimates of the errors, which are called the residuals, are the differences between observed values of yi and the predicted values yˆi :
uˆ i = y i − yˆ i = y i − βˆ 0 − βˆ 1 x i
OLS is the most commonly used estimator in the social sciences (to find beta)
OLS will be our workhorse estimator in this course
OLS obtains estimates of the “true” population parameters β0 and β1, which we typically do not observe
The logic of the OLS estimation procedure: choose βˆ0 and βˆ1 that minimize the sum of squared residuals: uˆ2
Why minimize the sum of squared residuals, instead of the sum of residuals or the absolute value of residuals?
If we use the sum of residuals, then residuals with different signs but similar magnitudes will cancel each other out
Minimizing sum of absolute values is a viable alternative but does not generate formulas for the resulting estimators
Relationship between PRF and SRF through residuals
y i = yˆ i + uˆ i
Total Sum of Squares (SST): Measure of sample variation in y
Explained Sum of Squares (SSE): Measure of the part of y
explained by x
Residual Sum of Squares (SSR): Part of the variation in y unexplained by x
R^2 and magnitude of relationship between y and x
As a measure of correlation, R2 should not be confused with the magnitude of the relationship between a DV and IV
You can have a bivariate relationship that has a high R2 (i.e., high correlation), but that has a slope that is close to 0
You can also have a bivariate relationship with a low R2 (i.e., low correlation), but that has a slope that is high in magnitude
What changes when you transform a regressor?
Bottom line: if we transform a regressor, then only the slope coefficient for that regressor is transformed
What happens if the relationship between wage and education is non-linear?
These patterns can be nicely modeled by re-defining the dependent and/or independent variables as natural logarithms
The linear regression model must be linear in the parameters, but not necessarily linear in the variables, so logging Y or X shouldn’t violate our requirement of a linear relationships between the dependent variables and its determinants
Linear regression models?
y = β0 + β1x + u log(y) = β0 + β1x + u log(y)= β0+β1log(x)+u y = log(β0+β1x+u) e^y = β0+β1 √x+u y = β0+ (β1x1)/(1 + β2x2) + u
y = β0 + β1x + u Yes
log(y) = β0 + β1x + u Yes
y =log(β0+β1x+u) Yes
e^y =β0+β1 √x+u Yes
y=β0+ (β1x1)/(1 + β2x2) + u No
1 If we exponentiate both sides of this equation, we get: e^y = β0 + β1x + u, which is linear in the parameters
wage= β0 + β1educ + u “ “
1 additional year of education is associated with an increase in wages
of β1 units
wage= β0 + β1log(educ) + u “ “
1% increase in education is associated with an increase in wages of β1/100 units
log(wage)= β0 + β1educ + u
1 additional year of education is associated with a (100 ∗ β1 )% increase in wages
log(wage)= β0 + β1log(educ) + u
1% increase in education is associated with a β1% increase in wages
Linearity in the parameters
the population model can be non-linear in the variables but must be linear in the parameters
individual observations are identically and independently distributed (i.e., observations are randomly selected from a population such that each observation has the same probability of being selected, independent of which other observations were selected.)
Sample Variation in the explanatory Variable
the sample standard deviation in xi must be greater than 0 (need some variance in order to get an estimate)
Zero Conditional Mean
E(u | X) = 0
if it holds, then the error term u is uncorrelated with the regressor X
this assumption is usually the biggest area of concern in empirical analysis
Var (u | X) = sigma^2
the variance of the unobservable error term, conditional on x, is assumed to be constant
Var(u) is independent of x
If assumptions 1-4 hold….
the OLS estimator is unbiased, meaning that on average, E(beta-hat) = beta
If assumptions 1-5 hold
if 1 - 5 hold, then we can derive a formula for the variance of the coefficient estimates, Var(beta-1)
What drives the variance of the OLS slope estimate? What makes it more precise?
the lower the variation in the errors
or the greater the var in the indept variable
or the greater the same size (relatedly b/c sample variation increases with sample size)
then the more precise the OLS estimates, on average
Standard errors measure… relationship with percision
precision or efficiency of the estimate beta1-hat
se-hat (beta1-hat) is lower (i.e. more precise) when:
the residuals are small
the variation of the independent variable is large
the number of observations is large
Motivation for multiple regression analysis
1) controlling for other factors:
even if you are primarily interested in estimating one parameter, including others in the regression will control of potentially confounding factors, the zero conditional mean assumption is more likely to hold
2) better predictions:
more independent variable can explain more of the variation in y, meaning potentially higher R^2
3) Estimating non-linear relationships:
by including higher order terms of a variable, we can allow for a more flexible, non-linear functional form between the dependent variable and an independent variable of interest
4) Testing joint hypotheses on parameters:
can test whether multiple independent variables are jointly statistically significant
How does OLS make estimates?
minimizes the sum of the squared residuals, combo of all the betas that gives the lowest sum of squared residuals
How do you isolate the variation that is unique to x3?
regress x3 on all the other regressors and obtain the residuals, e would contain the variation in x3 not explained by the other regressors from the initial population model, effectively holding all else constant
then we conduct a bivariant regression of y on e-hat
betak-hat represents what “ “
in terms of slope?
the partial association between y and xk holding x1, x2,…, xk-1 equal
beta-k–hat would be the slope of the multi deminsion plane along the x-k direction (ie the expected change in y when x1 increases by one unit, holding all other x’s constant
What does R^2 NOT tell you
high R^2 doesn’t mean that nay of the regressors are a true cause of the dependent variable
also does not meant that any of the coefficients are unbiased
Estimating non-linear relationships
by including higher order terms of an independent variable we can allow for a non-linear, or “more flexible functional form” between the dependent variable and an explanatory factor
include x^2 as a regressor, take the partial derivative, gives you the total effect of x in two parts (linear and non-linear)
if the first and second terms are substantively and significantly different from 0, then we have a situation where the sign and magnitude of the effect on wages can vary as x changes = non-linear relationship
“marginal return to x”, no ceteris paribus interpretation of individual parameters here, we must choose a given level of x and then describe the trade off
“for an individual with ten years of experience, accumulating an additional year of experience is expected to increase his/her hourly wage by $0.18
When can you interpret coefficients as causal?
when assumptions 1-4 hold, but usually ZCM fails, but it is more likely to hold with multiple regression
R-j^2, what is it? What does a high value mean?
it is the R^2 from regressing x-j on all of the other independent variables
a high R-j^2 is often the result of multicollinearity, high but not perfect correlation between two regressors
lead to imprecise estimates
When adding x3 to the regression model, what happens to var-hat( beta-1-hat)?
two countervailing channels:
will will almost certainly reduce sigma-hat^2 (squared residuals), which will make the estimate more precise, this reduction will depend on the extent to which x3 predicts y
by adding x3 we also introduce some correlation and perhaps multicollinearity between x1 and x3, or/and x2 and x3, which works against a more precise estimate of var-hat( beta-1-hat), this will depend on how correlated the regressors are
Gauss Markov Thm holds when?
Assumptions 1-5 hold, this means that the beta-hats are the best linear unbiased estimators (BLUE)
B-P tests evaluates.. and null hypothesis
whether the regressors jointly and significantly predict the variation in the squared residuals.
H0: error term doesn’t depend on regressors: beta-1-hat = beta-2-hat…. = 0
White test vs. B-P, if you reject the null in one…
White test is harder to pass because it tests for a linear and non-linear relationships between u^2 and all x-j. therefore, you are more likely to reject the null with the white test, therefore if you reject with the B-P test you will always reject with the White.
OVB is equal to..
the product of beta-2-hat and gamma-1-hat where beta2-hat is the estimated slope coefficient for x2 in the true model and gamma-1-hat is the slope coefficient of a bivariate regression of x2 on x1, hence if gamma-1-hat does not equal 0 there is a partial association between x2 and x1
OVB positive or negative?
corr(x1, x2) > 0 corr(x1, x2) <0
beta-2-hat > 0 + bias - bias
beta-2-hat < 0 - bias + bias
+ bias = beta-1-estimate over estimates true beta-1-hat
usually will end up causing a bias problem for all coefficients
bias-efficiency tradeoff, which is worse when?
bias is worse if you are making causal inferences (this is generally more important)
if you are making predictive inferences then imprecision (higher standard error) is worse
(beta-hat-newly included regressor) * (gamma-hat-original biased variable when “regress new old”)
beta-hat-educ (before you took out IQ) - beta-hat-educ (after you include IQ as a regressor) = OBV
how does sigma^2 relate to u?
sigma^2 = var(u-hat) = the average of the (unobserved) sum of squared errors
sigma^2 = E(u-hat^2) = the average of the (observed) sum of squared residuals (adjusted for k = 1 restrictions)
the Gauss Markov Assumptions guarantee that the error term u….
error term u, conditional on x, has a mean of 0 and a variance of sigma^2
beta-hat has a mean of beta and a variance of var(beta-hat)
beta-hat ~ normal (beta, var(beta-hat)) = beta-hat is normally distributed
normal sampling distribution of beta-hat for any sample size, even small samples
characterize beta-hat as a t-distributed random variable, therefore can use the t-stat to evaluate hypotheses
Error term is normally distributed.
If we had a small sample, we would additionally require the error term to be normally distributed. However, by virtue of the large size of the sample studied in this exercise, we can conclude that the error term has an approximately normal distribution. This follows from the Central Limit Theorem
the probability that the H0 is mistakenly rejected due to sampling error
the probability of obtaining a statistically significant association by chance when there is not in reality a statistically significant association
can you change your H1 based on your results?
NO, this is considered to bias inferences and goes against the spirit of hypothesis testing
describing a statistically significant t-test result
“the association between x and y is (positive/negative) and statistically significant at the 5% level, ceteris paribus
“we reject the null hypothesis at 5% level (after stating H0)”
statistical significance vs. economic significance
economic sig tells up about the magnitude of the coefficient relative to the sample mean of the dependent variable
ex. if the coefficient of years of education on annual wages is $5,000 and the average wage is $30,000, we would say that the effect represents $5,000/$30,000 = 16.7% of the average dependent variable
in words what does p-value of 0.02 mean?
“under the null hypothesis, the probability of obtaining a statistically significant result by chance is 2%
at 95% confidence: beta-hat +/- C of 97.5 * se-hat (beta-hat)
CI meaning and if only working with one sample..
if random samples were obtained over and over again and CIs were calculated each them, then the (unknown) beta would lie in the respective CIs of 95% of these samples
we often only work with one sample, so we do not know for sure whether our (random) sample is one of the 95% of samples where the interval estimate contains beta
F test conclusion in words
“the regressors are jointly statistically significant at the alpha level”
partial relationship using OVB formula
beta1-~ = beta1^ + beta2^ * gamma1^
where x1 = included variable, x2 = omitted variable and gamma is the partial association between x1 and x2
Assumption 6 and large samples
normality of the error term, unlikely to always hold so we relax this assumption in large samples
therefore if sample size is sufficiently large, you don’t need assumption 6 to run hypothesis testing
Consistency of the OLS Estimators
consistency is used to describe how the distribution of any estimator changes as the number of sample observations from a population increase.
property in which the distribution of an estimator “converges” of becomes more concentrated around the real population parameter as the sample increases, even if the OLS estimator is biased.
as sample size increases, the E(beta-hat) becomes increasingly close to beta, and the distribution of E(beta-hat) becomes increasingly narrow
when sample size goes to infinity, the distribution of the estimator becomes a single point - true beta
consistency and/or/vs. unbiased
theoretically possible to have an unbiased but inconsistent estimator
more commonly, we have a biased but consistent estimator, as sample size increases the mean of the distribution of beta-hat converges on beta
Law of Large Numbers: it can be shown that if assumptions 1-4 hold, beta-hat is not only an unbiased estimator of beta, but also a consistent estimator
that the variance of the error term systematically varies with, or depends on the explanatory variables, or any combination or function thereof
Breusch-Pagan Test definition and steps
evaluates whether u^2 is linearly associated with (x1, x2…., xk)
1) obtain the OLS residuals u^ form the original regression
2) compute u-hat^2
3) regress u-hat^2 on (x1, x2,…, xk)
u-hat^2 = gamma0 + gamma1x1 + gamma2x2,…., + E
get R^2 for u-hat^2
4) conduct an F-test for the joint sig of gamma1, gamma2….
White test tests..
whether u^2 is systematically and jointly related to the regressors, their squares, and their interactions
problem with white test and short cut
it can eat up a lot of DF when we have many independent variables
short cut: u-hat^2 = gamma0 + gamma1x1 + gamma2x2… + v
H0: gamma1 = gamma2 = 0
intuition behind robust standard errors
it is very often the case that the variance of the error term u depends on the regressors, but we don’t know exactly what form the heterokedasticity takes, remedy this situation with robust standard errors
Robust standard errors, what changes?
standard error changes, so test stats change
coefficient estimates will not change, R^2 will not change
don’t use if homoskedastic
what is meant by “the estimated coefficient of educ now has partiatl association interpretation”?
the coefficient of educ represents the association between cigs and years of educ after partialling out the shared variation between educ and the other regressors. We can now identify how wages vary with education while holding the other covariates constant. This feature of multivariate regression allows us to adopt a ceteris paribus interpretation of educ.
Explain, in words, why the Breusch-Pagan test is a test of the homoskedasticity assumption.
Homoskedasticity is the assumption that the variance of the error term is independent of the model regressors. The Breusch-Pagan tests evaluates the credibility of this assumption by evaluating the relationship between the squared residuals (in lieu of the squared errors, which we do not observe with sample data) and the model regressors. If the model regressors are strong joint predictors of the squared residuals, this would cast doubt on the assumption that squared errors are independent of the regressors.
If you had run the regression using heteroskedastic robust standard errors, would the estimated coefficients have been different? Why or why not?
The unbiasedness of OLS does not depend on the credibility of the homoskedasticity assumption. Consequently, if we had used heteroskedastic robust standard errors, the estimated coefficients would not have been different. However, the robust standard error estimates clearly would have differed
lwage = γ0 + γ1educ + γ2exper + γ3exper2+
γ4urban + γ5motheduc + γ6fatheduc + γ7nc + γ8west + γ9south + γ10educ_nc + γ11educ_west + γ12educ_south + ε
γ11 measures the …
γ11 measures the … difference between returns to education in the west and that of the northeast
γ1 measures the return to education in northeast only
LDVs are restricted variables – they can take on only few values (e.g. a few integers), have a restricted range (e.g. non-negative values), binary outcomes
Problems with LPM
1) LPM predicts probabilities less than 0 (0%) and greater than 1 (100%)
2) LPM not particularly good at “fitting” binary outcomes
• LPM assumes that parents’ education will have the same effect at high levels (when a child is already likely to be placed in a good school) than at moderate levels → LPM doesn’t “bend” to fit data
3) LPM is heteroskedastic
This can be seen by the distribution of the residuals around the regression line
In fact, by construction, LPM always violates the homoskedasticity assumption
Unadjusted standard errors will be biased and, consequently, statistical inference will not be valid
Advantages of LPM/Solutions to Issues
Transparent and easy to interpret
1) LPM predicts probabilities less than 0 (0%) and greater than 1 (100%)
• As long as Assumptions 1 through 4 hold, the estimators are still unbiased
• While predicted probabilities less than 0 and greater than 1 are clearly problematic, LPM works well for values close to average
2) LPM not particularly good at “fitting” binary outcomes
• You can always model non-linear associations by including higher-order terms or specifying regressors as logs
3) LPM is heteroskedastic
•Heteroskedastic robust standard errors are one way of retrieving unbiased standard errors
Functional Form Misspecification
The problem is that if the particular specification you estimate does not capture the appropriate functional form for the relationship in which you’re interested, the zero conditional mean assumption will be violated
Functional Form Misspecification, how do we know whether one specification is better than another?
Adjusted R 2 (also written as R 2 )
F -Test for Evaluating Nested Models Regression Specification Error Test (RESET)
A more rigorous way of evaluating the predictive power of a particular specification penalizes the inclusion of additional variables to a model
Adj R2 =1− [SSR/(n − k − 1)]/[SST/(n−1)]
In small samples, however, the R2 allows us to compare the predictive power of different specifications
However, the R2 is not particularly useful in large samples
◦ The difference between R2 and R2 is indistinguishable when n is very large
Also note that the R2 does not help us all that much in conducting hypothesis tests, particularly joint significance tests
◦ F-test of joint significance uses the R2, not the R2
only use the Adjust R2 for comparing nested models and non-nested models that have the same outcome variable
The Davidson-MacKinnon Test, when to use it and intuition behind it
The Davidson-MacKinnon Test can be used to evaluate nested and non-nested models with the same dependent variable
The intuition behind the Davidson-MacKinnon Test: if a particular specification is appropriate, then the fitted values of some alternative specification should not be significant predictors of the outcome variable
The Davidson-MacKinnon Test, steps
(1) y =β0 +β1x1 +β2x2 +u
(2) y = β0 + β1log(x1) + β2log(x2) + u
Estimate Equation (2, the alternative model) and compute the fitted values yˆ2◦ Estimate Equation (1) and add yˆ2 as a regressor: y = β0 +β1x1 +β2x2 +δyˆ2 +u
Our null hypothesisis H0 :δ=0
◦ Now conduct a t-test on δ
• If we fail to reject H0 → then there is evidence that Equation (1) iscorrectly specified
• If we reject H0 → then there is evidence that Equation (1) is misspecified
There is a problem with this test, however → we might reject both tests or fail to reject both tests
In that case referring to the adjusted R2 can be a good idea
RESET, when to use and intuition behind it
Advantages and Disadvantages
Regression Specification Error Test (RESET)
To evaluate the specification of any two nested or non-nested models, we can also implement the Regression Specification Error Test (RESET)
Suppose we have a standard population model
y =β0 +β1x1 +β2x2 +…+βkxk +u
The intuition behind RESET is that: if a model is properly specified, then non-linear functions of the regressors (i.e., higher order and interaction terms) should not be statistically significant predictors of the dependent variable
Limitation of RESET: No clear guideline of how to proceed if we reject H0
Advantages of RESET: We can conduct RESETs for a set of non-nested models and keep whichever model does not reject H0
We can therefore conduct the RESET by:
1) Generating yˆ2 and yˆ3 from the model we want to evaluate, and then plugging them back into the model:
y =β0 +β1x1 +β2x2 +…+βkxk +δ1yˆ2 +δ2yˆ3 +u 2
2) conduct an F-test with the null hypothesis
H0 :δ1 =δ2 =0 H1 :H0 does not hold
If we fail to reject H0 → our original model captured all non-linear relationships between the dependent and independent variables, and our model is therefore correctly specified
If we reject H0 → There are non-linearities that we haven’t accounted for, and our model is therefore misspecified
We know that omitted variables is often a threat to the zero conditional mean assumption
But! Adding more regressors might not always be a good idea It is possible to have a bad control problem
This is a problem in which one of your controls is an outcome of another regressor
This can complicate the ceteris paribus interpretation of multiple regression
Ex. can’t include both alcohol consumption and alcohol tax in the same model because the tax directly effects consumption, the ceteris paribus condition breaks down: we can’t hold alcohol consumption constant if taxes change
Measurement Error, types and consequences for OLS when in the dependent variable
• Most variables that we analyze in empirical research have been measured imprecisely for a variety of reasons
◦ Inappropriate measures of specific concepts
◦ Limited or uneven capacity to measure social, political, economic phenomena
◦ Variables that represent averages of more complex, granular information
Consequences for OLS:
the OLS estimates are less precisely estimated
This is because our our error term, ε contains both u and the measurement error, so that Var(ε) = Var(u) + Var(e0) > Var(u)
As a result, the error variance is higher when we have measurement
error in the dependent variable
The higher the variance of the measurement error → the higher the estimated error variance → Higher standard errors for all of the OLS estimates → Lower statistical significance for all of the OLS estimates
Measurement Error in the Independent Variables
Measurement Error in the Independent Variables
- So, when we use the mismeasured independent variable educ, then the ZCM Assumption breaks down
- This situation is called Classical Errors-in-Variables (CEV)
• Because the ZCM Assumption is not satisfied, OLS is biased and
- It can be shown that the direction of this bias is predictable
- Specifically, classical-errors-in-variables gives rise to attenuation bias, which means that the estimate βˆj is biased toward 0 (i.e., towards finding a practically smaller association or effect)
- With CEV, the OLS coefficients are also imprecise
This is not just a problem for the mis-measured regressor
If the mis-measured variable is cross-correlated with the other regressors, which is very likely, then all of the other regressors will be biased toward 0
What can we do about measurement error?
What can we do about measurement error?
- Data cleaning (i.e., check the data and re-code any identified mistakes)
- Instrumental variables estimation
If the data is missing at random…
◦ Then the random sampling assumption still holds
◦ OLS still unbiased and consistent
◦ But (randomly) missing observations reduce the sample size available
for a regression → less precise estimates
Instrumental Variables, Conditions
Suppose that we are interested in identifying the effect of x1 on y, but an omitted variable has resulted in a biased OLS estimate βˆ1
- Then an IV estimator, named z, is a variable that does not show up in Equation 3, but must relate to it in two different ways
1) The instrumental variable z must be “exogenous” to the outcome y
Cov(z,u) = 0: The instrument is uncorrelated with the error term (“Instrument Exogeneity Assumption”)
• This condition means that z is “as good as random” to the outcome variable
We can’t directly test assumption 1: We don’t observe u so we cannot definitively confirm or deny whether it is uncorrelated with a proposed instrument
Researchers have to make a good argument for why an instrument is exogenous, backed up by quantitative tests of the exogeneity assumption (which we will talk about in the next lecture)
2) The instrumental variable z must be a good at predicting variation in the independent variable x1
Cov(z,x) ̸= 0: The instrument is correlated with the independent variable for which we want to instrument (education in the previous example) (“Instrument Relevance Assumption”)
• By “correlated,” we mean that the instrument is a substantively and statistically significant predictor of the instrumented variable
the more correlated, the better
We can direction test assumption 2, If we regress the instrumented variable x on the instrument z (while controlling for the other regressors in the main population model), then we can directly observe the partial association between x and z
The direction of the relationship should make sense
Interprete βˆnjpost (DID)
The estimated coefficient βˆnjpost represents the change in the number of employees per restaurant in New Jersey relative to the change in the number of employees in Pennsylvania.
Why DID is better (NJ and Penn ex.)
This is because it accounts for pre-existing differences between New Jersey and Pennsylvania that did not change over the course of the policy process—for instance, class composition or geography. In the previous model, we were not be able to differentiate between differences in employment between New Jersey and Pennsylvania that were attributable to pre-existing differences or to the policy change enacted in New Jersey. With the difference-in-difference model, however, we can estimate how employment outcomes changed in New Jersey relative to Pennsylvania over time, giving us a more reliable estimate of the impact of the policy change.
Advantages and Disadvantages of Probit
The advantages of the Probit model (and maximum likelihood estimation models in general) relative to LPM are several. First, Probit models always predict probabilities bounded by 0% and 100%, while the LPM can predict probabilities outside of this range. Second, the Probit model automati- cally captures non-linear effects because it estimates changes in the probabilities implied by (linear) changes in the z-scores under a standard normal probability distribution. Third, the Probit model intrinsically accounts for heteroskedasticity, which means that estimated standard errors are always unbiased.
The disadvantages of the Probit model relative to LPM are that Probit estimates do not have a straightforward interpretation (they correspond with z-scores under a standard normal probability distribution), and that they rely on the assumption of a normal distribution of the error term, an assumption that we do not need to make under OLS when the sample size is asymptotically large. One consideration of the LPM is that, while it does not always predict probabilities bounded by 0% and 100%, it can reliably capture the ceteris paribus effect of a particular regressor on a binary outcome for observations with values close to the average for the regressor of interest
RESET Intuition and how you test
The intuition behind the RESET is that if a model is properly specified, then non-linear functions of the regressors (i.e., higher order and interaction terms) should not be statistically significant predictors of the dependent variable. We can obtain particular non-linear transformations of the covariates without expending a large number of degrees of freedom by computing and including yˆ2 and yˆ3.
Under the RESET, the null hypothesis is that the non-linear transformations of the regressors yˆ2 and yˆ3 are not systematically associated with the outcome variable.
F test of R^2 in restricted and unrestricted
Intuition of Over ID
Under the test of overidentifying restrictions, the dependent variable is the residual from the IV 2SLS estimation. This residual captures the endogenous variation in lwage—that is, the variation in logged wages not explained by the exogenous predictors (by assumption) from the second stage equation. Consequently, if one of the instrumental variables is statistically similar to the other in predicting the IV 2SLS residuals, after controlling for the exogenous controls from the structural equation, then this would provide evidence that the instruments are indeed exogenous
Why SE higher in IV vs. OLS
IV 2SLS strategy relies on a portion of the total variation in the endogenous variable that is predicted by the instrumental variable, whereas OLS employs the total variation in the endogenous variable
libcrd as valid instrument for educ?
For libcrd14 to be a valid instrument for educ, it must satisfy the instrument exogeneity and instrument relevance assumptions. For the instrument exogeneity assumption to hold, owning a library card at the age of 14 must be “as if” random to wage outcomes. That is, individuals with different income levels, educational achievements, and other relevant characteristics are equally likely to have held a library card at the age of 14. Yet another way of saying this is that library card ownership should not exhibit any direct effect on wages after controlling for the covariates in the structural equation.
For the instrument relevance assumption to hold, possessing a library card at adolescence needs to exhibit a statistically and substantively significant association with subsequent educational achieve- ment. If both of these assumptions hold, it follows that library card possession influences wage outcomes only through its effect on educational attainment.