Empirical considerations Flashcards
(18 cards)
Why does correlation not imply causation?
We cannot control through an OLS regression for all possible explanatory variables. This problem is called omitted confounders or omitted variable bias.
OLS regressions may produce precisely estimated
parameters that are significantly different from 0 but
could still be “wrong”! Wrong in that sends, that it fail to capture the true casual effect of the variable of interest.
Why is it important to implement all possible confounders and don’t believe in correlation?
It is important when we make policy, in our case, because we could design a policy using the wrong toll.
E.g. if we wanted to enhance peoples earnings. Based on a regression we found, that more years of education correlate with higher earnings, but in the real world we have excluded another possible explanatory variable - intelligence - which affect both years of education and earnings. Therefore we likely get a higher effect of years of education, than what is real, because it is driven by omitted variables such as intelligence. As a policy designer we could have thought if we just increase years of education than people would start earning more, but that might be a bad strategy if we want to design effective policy action.
What do we call omitted confounders?
It may be referred to as endogeneity issues or an identification problem.
Why is it so difficult exclude all potential confounders?
In the absence of a credible source of exogenous
variation, like in an experimental setting, it is very
difficult (if not impossible!) to claim that you have
been able to exclude all potential confounders from
your analysis.
What are some of the common
approaches in the
quest for causal
identification?
Instrumental Variables (IVs)
Natural experiments
Randomized trial
What is an instrumental variable?
An instrumental variable (aka, an instrument) induces changes in
the explanatory variable of interest (e.g., years of schooling),
but has no independent effect on the dependent variable (e.g.
earnings).
- Intuitively, the researcher uses the variation in 𝑥 induced by
variation in the instrument (𝑧) to measure the impact of 𝑥 on 𝑦. - Recall what the problem is: we cannot be sure that the impact of
𝑥 on 𝑦 is causal because there may be other omitted factors that
are correlated with both 𝑥 and 𝑦. - Using a valid IV, one that induces variation in 𝑥 but has nothing to
do with other omitted determinants of 𝑦, can be (intuitively)
viewed as isolating some variation in 𝒙 that derives from
changes in 𝒛 and can therefore be deemed independent of
other factors driving 𝒚.
How can we challenge a valid IV?
If we believe distance to school at age 5 to be a valid instrument describing earnings through years of education.
Assumption/Intuition: Being geographically close to a
school as a child influences the likelihood of completing
school, but is due to random factors that are not directly influencing earnings.
Such assumptions are often subject to challenges (e.g.
what if people’s geographic location has something
to do with their ability/productivity, parents wealth…)?
The researcher’s main job is to find plausible reasons
for the validity of the instrument.
What is a natural experiment?
Natural experiments occur when…
Certain individuals (or clusters of individuals) in the population of
interest are randomly exposed to a treatment due to events that are
outside the researcher’s control (and are therefore considered natural).
* By “randomly”, we mean exogenously, independently of their
characteristics or actions (i.e., allocation to treatment is naturally
random).
* By “treatment”, we may mean a sudden occurrence that may be linked
to changes in personal circumstances, economic conditions, policies.
Sudden policy changes are a classic candidate for this type of analysis.
Give an example on a natural experiment?
Natural experiments (Card and Krueger 1994)
* A pioneering example in this literature is the study by Card and
Krueger (1994) who compared the labour markets of two
neighbouring states in the US, only one of which had experienced a
hike in the Minimum Wage.
* The workers in one state were exogenously exposed to a policy
change (i.e., a treatment), which did not apply to those in the other
state (who, therefore, constituted a valid comparison group, as in a
controlled experiment).
One of the main difficulties related to this method is convincing
readers about the exogeneity of the change (e.g., what if policy
decisions were influenced by economic conditions that differed
across states? What if the change was anticipated, rather than
unexpected?)
What is a randomized controlled trial (RCT)?
- In an RCT, the researcher is (in principle) fully in control of the
respondents’ exposure to a given treatment. - Like in a medical trial, study participants are randomly split into a
treatment and a control group. - Comparing the two groups after treatment allows for clean causal
identification of the effects, since the random assignment ensures
that all the potential confounders are the same (on average)
between the two groups. - For example, in an RCT, my co-authors and I tested the impact of
different policies to help young Ethiopian jobseekers find better jobs.
What are the differences of a natural experiment and a RCT?
Unlike with a true experiment (RCT), where treatment and control groups are randomly and
explicitly chosen, the control and treatment groups in natural experiments arise from
the particular policy change.
What do we need to create a method for using a natural experiment?
In order to control for systematic differences between the
control and treatment groups, we need two years of data, one before the policy change and one after the change. Thus, our sample is usefully broken down into four groups:
the control group before the change, the control group after the change, the treatment
group before the change, and the treatment group after the change.
How do I test the validity of my IV-estimation?
- Relevance of the Instrument
Criticism: The instrument may be weak.
The instrument must be strongly correlated with the endogenous regressor. If not, IV estimates can be severely biased or imprecise.
Weak instruments can lead to large standard errors and misleading confidence intervals.
Check the first-stage F-statistic: below 10 is often considered weak (Stock & Yogo, 2005).
The F-statistic in the first stage regression of an instrumental variables (IV) estimator can be used to check for weak instruments. Specifically, a rule of thumb is that the F-statistic should be greater than 10 for the excluded instruments to be considered strong, indicating that the IV estimator is likely unbiased and valid.
- Exogeneity (Exclusion Restriction)
Criticism: The instrument may not be truly exogenous.
The instrument must not be correlated with the error term in the second-stage equation.
If the instrument affects the outcome variable through any other channel besides the endogenous regressor, the exclusion restriction is violated.
This is the most difficult assumption to test directly, so you must often rely on theoretical justification and institutional knowledge—which is open to critique.
Instruments derived from policy changes, geography, or historical variables may be correlated with unobserved confounders.
For example, using distance to a school as an instrument for education might correlate with household income or neighborhood characteristics.
- Sample Size and Finite Sample Bias
Criticism: IV estimates are less efficient than OLS and sensitive to small samples.
Especially with weak instruments, IV estimates in small samples may be worse than biased OLS.
Confidence intervals can be wide, and standard errors can be unstable.
- Lack of Transparency / Weak Justification for Instrument Choice
Criticism: The choice of instrument is not clearly justified or seems ad hoc.
If the paper does not thoroughly argue why the instrument is valid (relevant and exogenous), the entire IV strategy is suspect.
Over-reliance on “plausibly exogenous” instruments without rigorous justification is a red flag.
What then make a IV estimator strong?
The opposite than the downsides of IV estimation.
- Strong Institutional or Natural Experiment Setting
Argument: The instrument comes from a quasi-random or natural experiment setting.
Examples include lottery-based assignments, policy changes, draft lotteries, or randomized encouragement designs.
These contexts lend credibility to exogeneity, especially when the assignment is as-good-as-random.
- Replicability and External Support
Argument: The IV strategy has been replicated or supported in other studies or datasets.
Consistent findings across contexts or replication in other papers lend external validity.
Example: Multiple studies using similar instruments (e.g., quarter of birth for education) producing similar conclusions.
What make a natural experiment weak?
A natural experiment can be a powerful identification strategy—but not all natural experiments are strong. A natural experiment becomes weak when the conditions necessary for credible causal inference are poorly satisfied. Here are key reasons a natural experiment might be weak:
- Lack of Random or As-Good-As-Random Assignment
Weakness: The treatment isn’t truly random or quasi-random.
The “assignment” mechanism may appear exogenous, but in practice, it correlates with unobserved factors (e.g., selection bias).
Example: A policy rolled out in specific regions first, but those regions were wealthier or had different trends already.
Confounding Events
Weakness: Other events happened around the same time that could also explain the outcome.
If something else (e.g., a major economic shift, policy change, or shock) occurred simultaneously, you can’t isolate the effect.
Time effects get confounded with treatment effects.
- Non-Clear or Poorly Defined Treatment Group
Weakness: It’s unclear who is actually treated or affected by the intervention.
If the treatment group is ambiguous or overlaps with others, effects are muddled.
For instance, a policy change may not have been enforced uniformly or may have had fuzzy eligibility criteria.
- Non-Compliance or Spillovers
Weakness: Subjects do not follow their assigned “treatment”, or effects spill over to others.
Non-compliance weakens internal validity unless you address it with methods like IV.
Spillovers mean the control group may also be affected, contaminating estimates.
- Small Sample Size or Low Statistical Power
Weakness: The experiment doesn’t affect many observations or lacks power to detect effects.
A rare or minor natural experiment may not produce statistically significant results even if there is an effect.
- No Pre-Trends or Parallel Trends Validation
Weakness: You can’t rule out that trends were already diverging before the treatment.
Especially in difference-in-differences settings, you need evidence that pre-treatment trends were similar.
Without this, the treatment group may have just continued on its own trajectory.
What make a natural experiment strong?
A strong natural experiment closely approximates a randomized controlled trial (RCT) in how treatment is assigned and allows for credible causal inference. Here are the key features that make a natural experiment strong:
- As-Good-As-Random Assignment
Strength: The treatment is assigned in a way that mimics true randomness.
The variation arises from external, plausibly exogenous forces (e.g., policy changes, natural disasters, lotteries).
Assignment must be independent of potential outcomes and unobserved confounders.
Example: a draft lottery assigning military service based on birthdates.
2. Clear and Sharp Definition of Treatment
Strength: You can clearly identify who was treated and who wasn’t.
Treatment and control groups are well-defined and distinguishable.
A sharp cutoff (as in regression discontinuity) or a clearly timed intervention (in event studies or difference-in-differences) helps.
- No or Minimal Selection Bias
Strength: Subjects didn’t self-select into treatment based on unobservables.
Strong natural experiments limit participants’ ability to manipulate or sort into treatment.
Example: sudden, unexpected policy changes with no advance notice.
- Pre-Treatment Balance or Parallel Trends
Strength: Treated and control groups look similar before the intervention.
In difference-in-differences or event study designs, parallel trends in the pre-period support the counterfactual assumption.
In regression discontinuity, covariates are balanced around the cutoff.
- Robustness and Consistency of Results
Strength: Findings are consistent across specifications, subsamples, and robustness checks.
Results don’t change much with different functional forms, added controls, or alternative bandwidths (in RD).
Placebo tests or falsification exercises show no effects where none should be expected.
- Transparency and Institutional Plausibility
Strength: The design is well-explained and grounded in a clear institutional or policy context.
Authors provide a clear narrative of how the variation arose and justify assumptions with real-world logic.
If stakeholders didn’t anticipate or couldn’t manipulate the change, this strengthens causal claims.
What make an RCT weak?
A strong natural experiment closely approximates a randomized controlled trial (RCT) in how treatment is assigned and allows for credible causal inference. Here are the key features that make a natural experiment strong:
- As-Good-As-Random Assignment
Strength: The treatment is assigned in a way that mimics true randomness.
The variation arises from external, plausibly exogenous forces (e.g., policy changes, natural disasters, lotteries).
Assignment must be independent of potential outcomes and unobserved confounders.
Example: a draft lottery assigning military service based on birthdates.
2. Clear and Sharp Definition of Treatment
Strength: You can clearly identify who was treated and who wasn’t.
Treatment and control groups are well-defined and distinguishable.
A sharp cutoff (as in regression discontinuity) or a clearly timed intervention (in event studies or difference-in-differences) helps.
3. No or Minimal Selection Bias
Strength: Subjects didn’t self-select into treatment based on unobservables.
Strong natural experiments limit participants’ ability to manipulate or sort into treatment.
Example: sudden, unexpected policy changes with no advance notice.
4. Credible Exclusion Restriction (in IV settings)
Strength: If used as an instrument, the natural experiment affects the outcome only through the endogenous variable.
No direct or alternative channels linking the instrument to the outcome, aside from the treatment.
Strong theoretical justification and institutional detail back this up.
5. Stable Unit Treatment Value Assumption (SUTVA) Holds
Strength: No spillover effects between treated and control units.
One person’s treatment doesn’t affect another person’s outcome.
This ensures a clean separation between treated and untreated observations.
6. Pre-Treatment Balance or Parallel Trends
Strength: Treated and control groups look similar before the intervention.
In difference-in-differences or event study designs, parallel trends in the pre-period support the counterfactual assumption.
In regression discontinuity, covariates are balanced around the cutoff.
7. Robustness and Consistency of Results
Strength: Findings are consistent across specifications, subsamples, and robustness checks.
Results don’t change much with different functional forms, added controls, or alternative bandwidths (in RD).
Placebo tests or falsification exercises show no effects where none should be expected.
8. Transparency and Institutional Plausibility
Strength: The design is well-explained and grounded in a clear institutional or policy context.
Authors provide a clear narrative of how the variation arose and justify assumptions with real-world logic.
If stakeholders didn’t anticipate or couldn’t manipulate the change, this strengthens causal claims.
9. Good Statistical Power
Strength: The treatment affects enough observations to detect meaningful effects.
A large enough sample, or a large enough shock, allows for precise estimates and reduces noise.
10. External Validity (When Applicable)
Strength: The population affected is relevant to broader policy or theory.
The results generalize well, especially if the natural experiment mimics conditions faced by real-world policy targets.
Or, even if it only identifies a LATE, the group affected is important (e.g., marginal students in education reforms).
Bonus: Transparency About Limitations
Strength: Authors openly acknowledge and test potential weaknesses.
Credible studies don’t overclaim and often show where the assumptions could fail and how results might differ.
In essence, strong natural experiments are rare—but when they occur, they provide some of the most compelling causal evidence in observational research.
Would you like to assess a real-world example together to see if it qualifies as strong?
Du sagde:
What make an RCT weak?
ChatGPT sagde:
Great question. While Randomized Controlled Trials (RCTs) are often considered the gold standard in causal inference, they can still be weak or flawed depending on how they’re designed, implemented, or interpreted. Here’s what can make an RCT weak:
🔹 1. Non-Compliance / Attrition
Weakness: Participants don’t follow their assigned treatment, or drop out of the study.
Non-compliance (e.g., assigned to treatment but don’t take it) undermines intention-to-treat (ITT) interpretation.
High attrition—especially if differential between treatment and control—introduces selection bias.
Example: If sicker individuals drop out more in the control group, it biases health outcome comparisons.
🔹 2. Small Sample Size / Low Statistical Power
Weakness: The study is too small to detect meaningful effects.
Increases the chance of Type II errors (failing to detect a true effect).
Even when statistically significant, small samples can lead to highly variable or unstable estimates (the “winner’s curse”).
🔹 3. Implementation Issues / Contamination
Weakness: The treatment isn’t applied consistently, or the control group is inadvertently exposed.
Poorly implemented protocols reduce the reliability of estimated treatment effects.
Example: Control group accessing parts of the treatment due to spillovers or logistical mistakes.
🔹 4. Lack of External Validity
Weakness: Results don’t generalize to other contexts or populations.
The RCT setting may be too controlled, unrepresentative, or artificial.
If participants are selected from a narrow group (e.g., college students, volunteers, or a single village), findings may not apply elsewhere.
🔹 5. Short Time Horizon
Weakness: Effects are only measured in the short term.
Some treatments take time to materialize; early results might mislead about long-run effects.
Example: Educational interventions may not show outcomes until years later.
🔹 6. Hawthorne Effect / Behavioral Changes
Weakness: Participants behave differently simply because they know they’re being studied.
This distorts natural behavior, making the RCT less representative of real-world conditions.
🔹 7. Ethical Constraints
Weakness: Ethical limitations may constrain design or create selection bias.
Sometimes only certain people can ethically be included, or control groups must receive a minimum level of service—limiting contrast.
Example: You can’t deny lifesaving treatments, so the “control” still receives an intervention, diluting the estimated effect.
🔹 8. Imperfect Randomization
Weakness: Random assignment fails due to design flaws or manipulation.
If baseline characteristics are imbalanced, it can undermine the claim of comparability.
True randomness is often assumed but may not be realized in small samples or poor protocols.
🔹 9. Measurement Error or Subjectivity in Outcomes
Weakness: Outcomes are noisy or self-reported, leading to biased results.
Especially problematic for soft outcomes (e.g., happiness, self-reported health) subject to reporting bias.
Also an issue if outcome assessors aren’t blinded to treatment assignment.
What make an RCT strong?
A Randomized Controlled Trial (RCT) is considered strong when it closely adheres to the principles of rigorous experimental design and minimizes threats to internal and external validity. Here’s when an RCT is strong and credible:
✅ 1. Proper Randomization
Strength: Treatment is randomly assigned and well-implemented.
Ensures baseline comparability between treatment and control groups.
No selection bias or manipulation of assignment.
Randomization is often stratified or blocked to ensure balance on key variables.
✅ 2. Low Attrition and High Compliance
Strength: Participants stick with their assigned treatment and remain in the study.
Minimizes non-compliance and dropout-related bias.
Attrition is low and balanced across groups.
Allows for both intention-to-treat (ITT) and treatment-on-treated (TOT) analyses.
✅ 3. Sufficient Sample Size and Power
Strength: The study is adequately powered to detect meaningful effects.
Reduces chance of Type II errors.
Ensures effect estimates are precise and reliable.
Power calculations are done before the trial begins.
✅ 4. Clear and Well-Defined Treatment
Strength: The intervention is specific, replicable, and consistently delivered.
Easy to distinguish between treatment and control conditions.
No contamination or partial treatment effects in control group.
Implementation fidelity is high (treatment was delivered as planned).
✅ 5. Blinding (When Possible)
Strength: Participants and/or researchers are unaware of group assignments.
Reduces bias in behavior, reporting, and outcome assessment.
Especially important for subjective or qualitative outcomes (e.g., health, pain, mood).
✅ 6. Objective and Reliable Outcome Measures
Strength: Outcomes are clearly defined, accurately measured, and not prone to manipulation.
Objective metrics (e.g., test scores, lab results, income) reduce measurement bias.
If self-reported outcomes are used, they are validated or cross-checked.
✅ 7. Long Enough Follow-Up
Strength: The study captures both short- and long-term effects of the intervention.
Measures persistence of treatment effects over time.
Important in areas like education, health, and labor market outcomes.
✅ 8. No Spillovers or Contamination
Strength: Control group remains unaffected by the treatment.
Ensures a clean counterfactual comparison.
Study design (e.g., cluster randomization, geographic separation) prevents treatment-control interference.
✅ 9. High External Validity (When Relevant)
Strength: The population and context are representative or policy-relevant.
Participants are drawn from a population of practical interest (e.g., schools, clinics, regions).
Replicability across settings enhances generalizability.