Exam 3 Flashcards
(14 cards)
Linear Regression Questions
We will be using the following dataset, df:
- country: country
- year: year
- democracy: a binary variable indicating whether the country is a democracy according to Boix, Miller, and Rosato (BMR). Values of 1 mean the country is a
democracy, and values of 0 mean the country is an autocracy/dictatorship. - nr: natural resources as a percent of GDP (World Bank 2023)
- gdp_growth: economic growth measured via percent changes in GDP (World Bank
2023)
- What code would I use in R to obtain summary statistics for this dataset, df, as I have done below?
summary(df)
- The below presents a bivariate linear regression between gdp_growth and democracy. Fully interpret the regression. What do the results suggest?
Call:
lm(formula = gdp_growth ~ democracy, data = df)
Residuals:
Min 1Q Median 3Q Max
-55.242 -1.895 0.068 2.123 81.923
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.9039 0.1512 32.43 < 2e-16 **
democracy -1.4337 0.1983 -7.23 6.34e-13 *
—
Signif. codes: 0 ‘’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 4.93 on 2538 degrees of freedom
Multiple R-squared: 0.02018, Adjusted R-squared: 0.0198
F-statistic: 52.28 on 1 and 2538 DF, p-value: 6.343e-13
The bivariate regression above suggests that shifting from autocracy to democracy (i.e., that is what a one-unit increase in a binary autocracy/ democracy measure would imply)
is associated with a 1.43% decrease in economic growth. In other words, the model is estimating the effect of a democratic transition on economic growth. We can be rather confident in that result because it is statistically significant (the p-value is less than
.05). However, the R-squared value is rather low at .02, suggesting the regression is only explaining 2 percent of the variation in the dependent variable, economic growth. The Adjusted-R squared is very similar to the regular R-squared, so we can be confident that
the regular R-squared is rather unbiased. Finally, the p-value on the F-statistic is highly statistically significant, suggesting that our model here is better than a null model with
only the intercept. Thus, we seem to be on the right track here, but a model with only variable is probably incomplete, as the low R-squared suggests.
Side note for the students: these results are very much consistent with the literature. It suggests that the process of democratization is usually associated with a decline in economic growth at the initial stages; however, after a few years, the outcome usually changes and, over the long run, democracy is associated with an increase in economic growth. See, for example, the work of Daron Acemoglu & James Robinson as well
as Adam Przeworski. They suggest that the reason for the results is that democratic transitions involve a lot of turmoil within a regime, which hurts stability and, in turn, economic growth in the initial stages of democratization. Over the long run, democracy contributes to economic growth, notably because democracies are better at distributing public goods (e.g. education, health, clean air), which facilitates a better economic climate for climate.
- The regression below is a multivariate one that adds nr (natural resources) to the regression. Draw a labeled causal diagram that adequately depicts the purported relationship.
Call:
lm(formula = gdp_growth ~ democracy + nr, data = df)
Residuals:
Min 1Q Median 3Q Max
-56.466 -1.804 0.082 2.196 79.581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.177400 0.198805 21.013 < 2e-16 **
democracy -0.917369 0.217732 -4.213 2.60e-05 *
nr 0.051907 0.009297 5.583 2.61e-08 *
—
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 4.901 on 2537 degrees of freedom
Multiple R-squared: 0.03208, Adjusted R-squared: 0.03131
F-statistic: 42.04 on 2 and 2537 DF, p-value: < 2.2e-16
Z
↓ ↘
X → Y
In the causal diagram above:
* 𝑋 refers to democracy, which is the treatment/main independent variable of interest
* 𝑌 refers to economic/GDP growth, the outcome
* 𝑍 refers to natural resources, which is the (common-cause) confounder
- Fully interpret the multivariate regression (above) using all of the statistics that we spoke about in class. Does this multivariate regression appear to be better than the bivariate one above? Why or why not?
Call:
lm(formula = gdp_growth ~ democracy + nr, data = df)
Residuals:
Min 1Q Median 3Q Max
-56.466 -1.804 0.082 2.196 79.581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.177400 0.198805 21.013 < 2e-16 **
democracy -0.917369 0.217732 -4.213 2.60e-05 *
nr 0.051907 0.009297 5.583 2.61e-08 *
—
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Residual standard error: 4.901 on 2537 degrees of freedom
Multiple R-squared: 0.03208, Adjusted R-squared: 0.03131
F-statistic: 42.04 on 2 and 2537 DF, p-value: < 2.2e-16
In the above regression, a democratic transition from autocracy to democracy (i.e., a 1-unit increase in democracy) is associated with a 0.91% decrease in economic growth.
Additionally, a 1% increase in natural resources as a share of GDP is associated with a 0.05% increase in economic growth. The R-squared is to 3.2%, suggesting that this multivariate regression is capturing a greater share of the variation in the outcome, GDP growth, than the regular bivariate regression (2%). The adjusted R-squared is very similar to the regular R-squared, too. However, the F statistic in the multivariate
model (42) is lower than that of the bivariate model (52), suggesting that maybe that the natural resources variable isn’t that great of an addition to the model. The reason for that is likely that 𝑍 (natural resources as a percent of GDP) is not entirely independent from 𝑌 (GDP growth). That is why we always we want to know exactly what we are estimating, and causal diagrams are useful.
- Provide the code necessary in R to estimate a linear probability model (LPM), testing whether natural resources lead to democracy.
lm(democracy~nr,data=df)
As we may discern, an LPM is just linear regression with a binary dependent variable.
- Why might the LPM be problematic, and why might you want to choose logistic regression over the LPM? Provide at least two reasons. (Hint: you may want to draw something.)
- The LPM forces predictions to be linear.
– Clearly, not everything is linear
– Imposing linearity can artificially lower variance (as compared to bias), leading to artificially low p-values and false overconfidence in results - The LPM can estimate predicted probabilities for the dependent variable that are
less than 0 or greater than 1.
– Recall: By definition, all valid probabilities must fall between 0 and 1. - The LPM estimates predicted values of the dependent variable that are not zero or one, leading to heteroskedasticity (non-constant error variance) and the largest
possible errors where the predicted values of the dependent variable are near 0.5
– If errors are large when predictions are close to 0.5, we will have trouble
classifying everything into a 0 or 1 bucket. For example, if the predicted
probability of defaulting on a loan was 0.51, a loan officer wouldn’t be too sure whether the person would default. We would prefer a number like 0.9 (high probability of default) of 0.1 (low probability of default). - The logistic distribution constrains probabilities to valid ones between 0 and 1.
- The below present a stock logistic regression output. What can you easily interpret from it? Be specific and provide any interpretations, as appropriate.
Call:
glm(formula = democracy ~ nr, family = binomial(“logit”), data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
int 1.126456 0.058218 19.35 <2e-16 **
nr -0.106497 0.005942 -17.92 <2e-16 *
—
Signif. codes: 0 ‘’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3453.4 on 2539 degrees of freedom
Residual deviance: 2922.1 on 2538 degrees of freedom
AIC: 2926.1
Number of Fisher Scoring iterations: 5
You can easily interpret the sign of the effect of nr, which is negative. You can also easily interpret the p-value, which is statistically significant. It is difficult to interpret the coefficient, because it is expressed in terms of log odds.
- The below presents the marginal effects from the logistic regression.
Fully interpret them.
Term Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
nr -0.0209 0.000876 -23.9 <0.001 415.8 -0.0226 -0.0192
Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
Type: response
The marginal effects are easy to interpret. They suggest that a 1% increase in natural resources (again, see the summary stats above for the percent interpretation) is associated with a 2 percentage point decline in the likelihood of democracy. As above, the effect is statistically significant.
- The below presents a confusion matrix from the logistic regression.
Explain what are the false positives and false negatives, and explain what they are in this specific context.
FALSE TRUE
0 482 581
1 180 1297
False positives are when you reject the null hypothesis but is actually true. False negatives are when you fail to reject the null hypothesis when it is false. In this context, false
positives are when we predicted that natural resources would lead to democracy but that was false (482); and false negatives are when we predicted that natural resources would not lead democracy but that was false (180).
- What is the fundamental problem of causal inference? Does randomization help overcome it? If so, how?
The fundamental problem of causal inference is that you can’t observe each unit in both their treatment and control states. For example, if I went to the movies at a certain time, we can’t observe a world in which I didn’t go the movies at that specific time.
Randomization helps overcome this problem because, as the sample sizes increases to infinity, it equals out/nullifies the effect of potential confounders, making the treatment
determative of the outcome
- What is a field experiment? How it is different than a lab and a survey
experiment? Also, provide some pros and cons of the different methods.
A field experiment randomizes a treatment in a real-world environment. Accordingly,
field experiments have ecological validity, especially as compared to lab experiments and survey experiments, which tend to suffer from Hawthorne and demand effects (i.e., respondents changing their answers in response to what they think are the desired
answers or the fact that they are under scrutiny). Survey experiments are easier to target at larger share of the population, making them often have higher external validity than lab and field experiments. The best type of experiment for controlling the environment are lab experiments, but that often comes at the cost of realism.
- What is a regression discontinuity design? Explain it fully in words and draw the relevant graph. Use an example to demonstrate your knowledge of
the method.
RD mainly relies on a continuous variable (e.g., score on standardized test), which deterministically assigns each observation into either treatment (e.g., getting a test score
above the cutoff) or control (e.g., getting a test score below the cutoff). Assuming that the treatment-control cutoff is genuine, the data have a relatively equal number of treatment and control observations (i.e., no sorting), and the effects of all potentially relevant covariates are irrelevant/ignorable, researchers can generally consider observations right
on either side of the cutoff to be as-if randomly assigned. In such instances, the RD can be a powerful method for uncovering an Average Treatment Effect (ATE) at the cutoff, which the literature also calls a Local Average Treatment Effect (LATE).
To make RD more concrete, let’s continue with a test-related example from Hoekstra (2009). In this study, Hoekstra (2009) examines whether attending a flagship university
in an unnamed state affects the lifetime earnings of attendees. The example is a good one for RD because the university used an arbitrary threshold cutoff score for the SAT as a means of determining whether students could be admitted. In reality, the university considered other factors beyond SAT scores, but let’s just assume that the SAT score was the only factor determining admission for the sake of simplicity.
The above graph below presents Hoekstra’s (2009) main findings. As one can clearly see, the students who received admission to the flagship university earned significantly more
money over their lifetime as compared to students who missed the SAT score cutoff. Directly at the arbitrary score cutoff, the difference between the treated students who
just made cutoff as opposed to the control students who just missed it is usually as good as randomly assigned. The reason is that maybe one student who exceeded the threshold had a good breakfast or a good night sleep before the test and thus performed just a
little better. Alternatively, maybe the students who just missed the cutoff were unlucky with the questions that showed up and performed a little worse. In short, there are many reasons why students could have performed a little better or a little bit worse on either side of the cutoff. However, in all likelihood, the differences between the students
who just barely made it as opposed to those who just missed it are probably minimal. Apparently, though, the effects of attending the flagship university in this particular
state are not trivial, as such students earn lots more money and perhaps are happier over the long run. Of course, money does not always buy happiness, but it often does.
- In the context of a study examining the impact of oil revenues on democratization, a case study examining the impact of oil revenues on highly democratic Norway would be an example of what type of case? Provide the name of the type of case and explain why that is a good case type for this issue.
It would be a deviant case, because normally natural resource revenues do not contribute to democratization. We discussed this class when referring to Bueno de Mesquita. We
can also even see that from the logistic regression that we ran above.