Exam 2 Flashcards

(22 cards)

1
Q
  1. Bayes Rule is an example of what type of probability?
A

Conditional probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. Provide the formula for Bayes Rule in the case that event 𝐴 depends on event 𝐡.
A

𝑃(𝐴|𝐡) =
𝑃(𝐡|𝐴) 𝑃(𝐴) / 𝑃(𝐡) =
𝑃(𝐡|𝐴) 𝑃(𝐴) / 𝑃(𝐡|𝐴) 𝑃(𝐴) + 𝑃(𝐡|𝐴𝑐) 𝑃(𝐴𝑐)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. If events 𝐴 and 𝐡 are independent, what is the probability of observing both event 𝐴 and event 𝐡?
A

𝑃 (𝐴 π‘Žπ‘›π‘‘ 𝐡) = 𝑃 (𝐴) Γ— 𝑃 (𝐡)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. Provide an example of two events that are independent and explain why they are independent.
A

Whether I am wearing green socks and whether the president decides to speak about the Middle East today are independent events. They are independent because one has no relation to the other’s data-generating process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. Draw a Venn diagram of the two events that you described in the previous question. Then, draw a Venn Diagram of the events in a world where they were not independent.
A

ထ Mutually Exclusive Events
⚭ Not Mutually Exclusive Events
The president speaking about the Middle East today and me deciding whether to put on green socks are mutually exclusive events if and only if I do not own green socks and will never wear green socks. By contrast, the president frequently speaks about the Middle East, and it’s both hard to predict and impossible to control. The events would not be mutually exclusive if I owned green socks and there a was a non-zero probability that I would wear them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. A patient goes to see a doctor. The doctor performs a test with 99 percent reliability–that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. The doctor knows that only 1 percent of the people in the country are sick. If the patient tests positive, what are the chances that the patient is sick?
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. How do Bayesian and Frequentist statistics differ? Explain in words and provide the relevant formulas for the probabilities that they are both calculating.
A

Frequentist statistics assume a modelβ€”usually, a normal distribution with 95% confidence intervalsβ€”and use that model to calculate the probability of some data occurring: 𝑃(π·π‘Žπ‘‘π‘Ž|π‘€π‘œπ‘‘π‘’π‘™).

By contrast, Bayesian Statistics involves subjective probability with a prior and aim to calculate the probability of a model occurring given some data: 𝑃(π‘€π‘œπ‘‘π‘’π‘™|π·π‘Žπ‘‘π‘Ž).
The latter can be precisely calculated through Bayes’ Rule:
𝑃(π‘€π‘œπ‘‘π‘’π‘™|π·π‘Žπ‘‘π‘Ž) = 𝑃(π·π‘Žπ‘‘π‘Ž|π‘€π‘œπ‘‘π‘’π‘™) 𝑃(π‘€π‘œπ‘‘π‘’π‘™) / 𝑃 (π·π‘Žπ‘‘π‘Ž)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. What is variance, and what is standard deviation? How do they relate to each other? Provide the verbal explanations and formulas.
A

Variance refers to the spread of the data, which captures the squared mean distance. Standard deviation is the square root of the variance. Thus, we cannot understand the standard deviation without the variance. Conceptually, the standard deviation (roughly) captures the average distance of the data from the mean.
pop var: βˆ‘(π‘₯𝑖 βˆ’ π‘₯)Μ„Β²/𝑁
sample var: βˆ‘(π‘₯𝑖 βˆ’ π‘₯)Μ„Β²/𝑁-1
pop sd: √ βˆ‘(π‘₯𝑖 βˆ’ π‘₯)Μ„Β²/𝑁
sample sd: √ βˆ‘(π‘₯𝑖 βˆ’ π‘₯)Μ„Β²/𝑁-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. Why do we sampleβ€”what is the point?
A

The point of sampling is to learn about a population of interest. We are rarely, if ever, are only interested in the particular sample that we create or analyze.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. What are the basic elements of any ggplot in R? Provide the code and explain in words.
A

load libraries

library(tidyverse)
library(ggplot2)

ggplot(data, aes(x,y)) +
geom_point() +
labs(x = β€œX axis name”,
y = β€œY axis name”,
title = β€œtitle here”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. In R, how would you obtain the variance of a variable called population in a data frame called df? Assume that the variable has missing values.
A

var(df$population, na.rm = TRUE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. What is a normal distribution? Comprehensively explain everything in
    words and draw anything that you need to draw.
A

A Normal (Gaussian) distribution is a theoretical distribution that looks like a bell curve (see below). The normal distribution is incredibly important because most of frequentist statistics assumes that data that are normally distributed. When that is not the case, it is hard to believe most statistical estimates. Per the Central Limit Theorem, most data converges toward a Normal distribution as 𝑁 β†’ βˆžβ€”that is, as the sample size (𝑁) becomes infinitely large. The Law of Large Numbers says something similar: as 𝑁 β†’ ∞, the sample mean, 𝑋̄, converges toward the population mean, πœ‡, given a
sufficient number of samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. What is a t-ratio score? Provide the formula and explain in words how it differs from a z-score.
A

t = (π‘₯Μ„ βˆ’ πœ‡) / (𝑠/βˆšπ‘›)
𝑧 = 𝑋̄ βˆ’ πœ‡ / 𝜎
1. Given the above formulas, the t-test substitutes the standard error, 𝑠π‘₯Μ„, for the standard deviation, 𝜎.
2. Given the above formulas, the t-test takes into account the sample size, whereas the z-score does not.
3. The t-test involves a t-distribution, which is slightly different than a regular normal distribution, because the t-distribution depends on:
* sample size
* degrees of freedom (𝑁 βˆ’ 1)
4. In practice, using the t-test will just change the statistic that we multiply in the margin of error to obtain our confidence interval.
* 𝐢𝐼 = 𝑋̄ Β± 𝑀𝑂𝐸 = 𝑋̄ Β± 𝑑 Γ— 𝑠𝑋̄

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. What is a cross-tab? Explain your answer with an appropriate example.
A

Crosstabs are also known as contingency tables, where it is usually better to have at least one binary measure.
______________________________________________
democracy 0 1
–––––––––––––––––––––––––––––––––––––––––
0 24.7% 75.1

1 74.9% 24.9%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. Explain the relationship between 𝑋 and π‘Œ in the scatterplot below.
A

The above plot depicts a negative relationship between 𝑋 and π‘Œ . The relationship is negative because, as 𝑋 increases, π‘Œ decreases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. The screenshot below is taken directly from our Li (2018) textbook. Recall Li (2018) compared economic growth in 1960 (growth.x) and economic growth in 1990 (growth.y). Interpret the output below.
A

The results above suggest there is a statistically significant difference between growth in 1960 and growth in 1990. Notably, the 𝑝-value is 0.008, which is less than 0.05. Furthermore, the 95% confidence interval does not run through 0, which confirms the statistically significant difference. We thus can reject the null hypothesis that that there is no difference between growth in 1960 and growth in 1990.

17
Q
  1. What code would I use in R to obtain summary statistics for this dataset, data, as I have done below?
A

summary(data)

18
Q
  1. The below presents a bivariate linear regression between the pre-primary enrollment rate and growth. Fully interpret the regression. What do the results suggest? Make sure to use all of the statistics available below that we spoke about in class to substantiate your interpretation of the results. Also, make sure to define all of the statistics that you use to prove that you know what you are explaining. Finally, provide some logic to describe why we are getting the results that you see below.
A

The above regression indicates that pre-primary enrollment has a negative effect on economic growth, which is measured in terms of GDP growth. More specifically, the preprimary_enroll_rate coefficient indicates that increasing pre-primary enrollment by 1% yields a 2% decrease in growth. We can interpret these numbers in terms of percents due to what we learned about the distributions of the respective variables from the summary() command and my introduction to the data (above). We can also be fairly confident in these results because the p-value is below 0.001, indicating that the preprimary_enroll_rate coefficient is highly statistically significant. Nevertheless, the 𝑅2 is low at 0.04, suggesting that our model thus far only explains around 4% of the variation in our dependent variable, economic growth. The Adjusted 𝑅2, which corrects for the bias in the calculation of 𝑅2, is very similar to the regular 𝑅2, so there does not appear to be a need to question our assessment thus far. Furthermore, the p-value associated with the 𝐹-statistic is well below 0.001, suggesting that our base model with only the preprimary_enroll_rate variable is much better than a null model without any coefficients. Overall, we can reject the null hypothesis that there is no relationship between ethnic fractionalization and economic growth and be rather confident that the relationship is negative. However, before suggesting that the literature is wrong, we need to allow for the fact that this hypothesis could be overturned once we introduce fuller models that have higher 𝑅2 statistics, suggesting that they capture more of the variation in economic growth.

19
Q
  1. The regression below is a multivariate one that adds debt_gdp to the regression, given that countries may not have money to invest in pre-primary education if they are in debt. Fully interpret the regression using all of the statistics below. Does this multivariate regression appear to be better than the bivariate one above? Why or why not?
A

Like the bivariate regression in the previous question, the regression suggests that preprimary_enroll_rate negatively contributes to gdp_growth. Why? First, the preprimary_enroll_rate coefficient is statistically significant with a p-value well be- low 0.05. Second, the preprimary_enroll_rate coefficient is negative. More specifi- cally, a 1% increase in preprimary_enroll_rate is associated with a 2.23% decrease in economic growth. In terms of the overall variance in economic growth explained, which we capture with 𝑅2, it is 0.08, which around double the size of the 𝑅2 in the bivariate regression. As with the bivariate regression, the Adjusted 𝑅2 is very close to the regular 𝑅2, so there do not appear to be any red flags in terms of poorly chosen independent variables in our model. The jump in the 𝐹-statistic in this multivariate regression to 34 from 44 in the bivariate regression also supports our decision to add the debt as a share of GDP variable to the model. Overall, the multivariate model is clearly better than the bivariate one.

20
Q
  1. The output below presents both the bivariate/pairwise correlation be- tween pre-primary enrollment and debt as a share of GDP as well as the variance inflation factor (VIF) statistics for each coefficient. Does there ap- pear to be a risk of multicollinearity in the regression above? Why or why not? Also, define multicollinearity and tell us whether it is generally a good or bad thing for our regression.
A

The correlation appears of 0.18 between the pre-primary enrollment rate and the debt variable is rather low, and the VIF statistics are much lower than 10, so there does not appear to be much of a risk of multicollinearity. The latter occurs when two or more independent variables are explaining the same variation in the dependent variable, which can result in unstable models. Given that we want stable models, not unstable ones, we generally want to avoid multicollinearity.

21
Q
  1. Based on the results below, is a linear specification the right one for this model?
A

The above results suggest that, indeed, the linear specification is correct. First, the graph of the fitted values and the residuals is rather straight and is centered on 0. Second, the Ramsay RESET test does not show a statistically significant difference between the regular multivariate model and the multivariate_RESET model with the I(.fitted^2) term. Accordingly, we fail to reject the null hypothesis that the two models are the same.

22
Q

Bonus. How would you use R to merge two data frames, π‘₯ and 𝑦, such that (i) all of the values in π‘₯ would act as unique identifiers; and (ii) everything in both datasets would merge in even if there were mispellings? Assume that the unit of analysis identifying both π‘₯ and 𝑦 is the country-year. Call the final merged data frame β€œmerged”.

A

load libraries

library(tidyverse)
library(tidylog)
library(countrycode)

x$countrycode = countrycode::countrycode(
sourcevar = x$country,
origin = β€œcountry.name”,
destination = β€œcown”,
warn = TRUE)

x$countrycode[x$country==”Serbia”] =345

x$countrycode = countrycode::countrycode(
sourcevar = y$country,
origin = β€œcountry.name”,
destination = β€œcown”,
warn = TRUE)

x$countrycode[x$country==”Serbia”] =345

initial = left_join(x,y, by=c(β€œcountrycode”,”year”))

checking = anti_join(x,y, by=c(β€œcountrycode”,”year”))
table(checking$country)

#re-run the merge after correcting any issues
merged = left_join(x,y, by=c(β€œcountrycode”,”year”))