Exam 2 Flashcards
(22 cards)
- Bayes Rule is an example of what type of probability?
Conditional probability.
- Provide the formula for Bayes Rule in the case that event π΄ depends on event π΅.
π(π΄|π΅) =
π(π΅|π΄) π(π΄) / π(π΅) =
π(π΅|π΄) π(π΄) / π(π΅|π΄) π(π΄) + π(π΅|π΄π) π(π΄π)
- If events π΄ and π΅ are independent, what is the probability of observing both event π΄ and event π΅?
π (π΄ πππ π΅) = π (π΄) Γ π (π΅)
- Provide an example of two events that are independent and explain why they are independent.
Whether I am wearing green socks and whether the president decides to speak about the Middle East today are independent events. They are independent because one has no relation to the otherβs data-generating process.
- Draw a Venn diagram of the two events that you described in the previous question. Then, draw a Venn Diagram of the events in a world where they were not independent.
α Mutually Exclusive Events
β Not Mutually Exclusive Events
The president speaking about the Middle East today and me deciding whether to put on green socks are mutually exclusive events if and only if I do not own green socks and will never wear green socks. By contrast, the president frequently speaks about the Middle East, and itβs both hard to predict and impossible to control. The events would not be mutually exclusive if I owned green socks and there a was a non-zero probability that I would wear them.
- A patient goes to see a doctor. The doctor performs a test with 99 percent reliabilityβthat is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. The doctor knows that only 1 percent of the people in the country are sick. If the patient tests positive, what are the chances that the patient is sick?
- How do Bayesian and Frequentist statistics differ? Explain in words and provide the relevant formulas for the probabilities that they are both calculating.
Frequentist statistics assume a modelβusually, a normal distribution with 95% confidence intervalsβand use that model to calculate the probability of some data occurring: π(π·ππ‘π|πππππ).
By contrast, Bayesian Statistics involves subjective probability with a prior and aim to calculate the probability of a model occurring given some data: π(πππππ|π·ππ‘π).
The latter can be precisely calculated through Bayesβ Rule:
π(πππππ|π·ππ‘π) = π(π·ππ‘π|πππππ) π(πππππ) / π (π·ππ‘π)
- What is variance, and what is standard deviation? How do they relate to each other? Provide the verbal explanations and formulas.
Variance refers to the spread of the data, which captures the squared mean distance. Standard deviation is the square root of the variance. Thus, we cannot understand the standard deviation without the variance. Conceptually, the standard deviation (roughly) captures the average distance of the data from the mean.
pop var: β(π₯π β π₯)ΜΒ²/π
sample var: β(π₯π β π₯)ΜΒ²/π-1
pop sd: β β(π₯π β π₯)ΜΒ²/π
sample sd: β β(π₯π β π₯)ΜΒ²/π-1
- Why do we sampleβwhat is the point?
The point of sampling is to learn about a population of interest. We are rarely, if ever, are only interested in the particular sample that we create or analyze.
- What are the basic elements of any ggplot in R? Provide the code and explain in words.
load libraries
library(tidyverse)
library(ggplot2)
ggplot(data, aes(x,y)) +
geom_point() +
labs(x = βX axis nameβ,
y = βY axis nameβ,
title = βtitle hereβ)
- In R, how would you obtain the variance of a variable called population in a data frame called df? Assume that the variable has missing values.
var(df$population, na.rm = TRUE)
- What is a normal distribution? Comprehensively explain everything in
words and draw anything that you need to draw.
A Normal (Gaussian) distribution is a theoretical distribution that looks like a bell curve (see below). The normal distribution is incredibly important because most of frequentist statistics assumes that data that are normally distributed. When that is not the case, it is hard to believe most statistical estimates. Per the Central Limit Theorem, most data converges toward a Normal distribution as π β ββthat is, as the sample size (π) becomes infinitely large. The Law of Large Numbers says something similar: as π β β, the sample mean, πΜ, converges toward the population mean, π, given a
suο¬icient number of samples.
- What is a t-ratio score? Provide the formula and explain in words how it differs from a z-score.
t = (π₯Μ β π) / (π /βπ)
π§ = πΜ β π / π
1. Given the above formulas, the t-test substitutes the standard error, π π₯Μ, for the standard deviation, π.
2. Given the above formulas, the t-test takes into account the sample size, whereas the z-score does not.
3. The t-test involves a t-distribution, which is slightly different than a regular normal distribution, because the t-distribution depends on:
* sample size
* degrees of freedom (π β 1)
4. In practice, using the t-test will just change the statistic that we multiply in the margin of error to obtain our confidence interval.
* πΆπΌ = πΜ Β± πππΈ = πΜ Β± π‘ Γ π πΜ
- What is a cross-tab? Explain your answer with an appropriate example.
Crosstabs are also known as contingency tables, where it is usually better to have at least one binary measure.
______________________________________________
democracy 0 1
βββββββββββββββββββββββββββββββββββββββββ
0 24.7% 75.1
1 74.9% 24.9%
- Explain the relationship between π and π in the scatterplot below.
The above plot depicts a negative relationship between π and π . The relationship is negative because, as π increases, π decreases.
- The screenshot below is taken directly from our Li (2018) textbook. Recall Li (2018) compared economic growth in 1960 (growth.x) and economic growth in 1990 (growth.y). Interpret the output below.
The results above suggest there is a statistically significant difference between growth in 1960 and growth in 1990. Notably, the π-value is 0.008, which is less than 0.05. Furthermore, the 95% confidence interval does not run through 0, which confirms the statistically significant difference. We thus can reject the null hypothesis that that there is no difference between growth in 1960 and growth in 1990.
- What code would I use in R to obtain summary statistics for this dataset, data, as I have done below?
summary(data)
- The below presents a bivariate linear regression between the pre-primary enrollment rate and growth. Fully interpret the regression. What do the results suggest? Make sure to use all of the statistics available below that we spoke about in class to substantiate your interpretation of the results. Also, make sure to define all of the statistics that you use to prove that you know what you are explaining. Finally, provide some logic to describe why we are getting the results that you see below.
The above regression indicates that pre-primary enrollment has a negative effect on economic growth, which is measured in terms of GDP growth. More specifically, the preprimary_enroll_rate coeο¬icient indicates that increasing pre-primary enrollment by 1% yields a 2% decrease in growth. We can interpret these numbers in terms of percents due to what we learned about the distributions of the respective variables from the summary() command and my introduction to the data (above). We can also be fairly confident in these results because the p-value is below 0.001, indicating that the preprimary_enroll_rate coeο¬icient is highly statistically significant. Nevertheless, the π 2 is low at 0.04, suggesting that our model thus far only explains around 4% of the variation in our dependent variable, economic growth. The Adjusted π 2, which corrects for the bias in the calculation of π 2, is very similar to the regular π 2, so there does not appear to be a need to question our assessment thus far. Furthermore, the p-value associated with the πΉ-statistic is well below 0.001, suggesting that our base model with only the preprimary_enroll_rate variable is much better than a null model without any coeο¬icients. Overall, we can reject the null hypothesis that there is no relationship between ethnic fractionalization and economic growth and be rather confident that the relationship is negative. However, before suggesting that the literature is wrong, we need to allow for the fact that this hypothesis could be overturned once we introduce fuller models that have higher π 2 statistics, suggesting that they capture more of the variation in economic growth.
- The regression below is a multivariate one that adds debt_gdp to the regression, given that countries may not have money to invest in pre-primary education if they are in debt. Fully interpret the regression using all of the statistics below. Does this multivariate regression appear to be better than the bivariate one above? Why or why not?
Like the bivariate regression in the previous question, the regression suggests that preprimary_enroll_rate negatively contributes to gdp_growth. Why? First, the preprimary_enroll_rate coeο¬icient is statistically significant with a p-value well be- low 0.05. Second, the preprimary_enroll_rate coeο¬icient is negative. More specifi- cally, a 1% increase in preprimary_enroll_rate is associated with a 2.23% decrease in economic growth. In terms of the overall variance in economic growth explained, which we capture with π 2, it is 0.08, which around double the size of the π 2 in the bivariate regression. As with the bivariate regression, the Adjusted π 2 is very close to the regular π 2, so there do not appear to be any red flags in terms of poorly chosen independent variables in our model. The jump in the πΉ-statistic in this multivariate regression to 34 from 44 in the bivariate regression also supports our decision to add the debt as a share of GDP variable to the model. Overall, the multivariate model is clearly better than the bivariate one.
- The output below presents both the bivariate/pairwise correlation be- tween pre-primary enrollment and debt as a share of GDP as well as the variance inflation factor (VIF) statistics for each coeο¬icient. Does there ap- pear to be a risk of multicollinearity in the regression above? Why or why not? Also, define multicollinearity and tell us whether it is generally a good or bad thing for our regression.
The correlation appears of 0.18 between the pre-primary enrollment rate and the debt variable is rather low, and the VIF statistics are much lower than 10, so there does not appear to be much of a risk of multicollinearity. The latter occurs when two or more independent variables are explaining the same variation in the dependent variable, which can result in unstable models. Given that we want stable models, not unstable ones, we generally want to avoid multicollinearity.
- Based on the results below, is a linear specification the right one for this model?
The above results suggest that, indeed, the linear specification is correct. First, the graph of the fitted values and the residuals is rather straight and is centered on 0. Second, the Ramsay RESET test does not show a statistically significant difference between the regular multivariate model and the multivariate_RESET model with the I(.fitted^2) term. Accordingly, we fail to reject the null hypothesis that the two models are the same.
Bonus. How would you use R to merge two data frames, π₯ and π¦, such that (i) all of the values in π₯ would act as unique identifiers; and (ii) everything in both datasets would merge in even if there were mispellings? Assume that the unit of analysis identifying both π₯ and π¦ is the country-year. Call the final merged data frame βmergedβ.
load libraries
library(tidyverse)
library(tidylog)
library(countrycode)
x$countrycode = countrycode::countrycode(
sourcevar = x$country,
origin = βcountry.nameβ,
destination = βcownβ,
warn = TRUE)
x$countrycode[x$country==βSerbiaβ] =345
x$countrycode = countrycode::countrycode(
sourcevar = y$country,
origin = βcountry.nameβ,
destination = βcownβ,
warn = TRUE)
x$countrycode[x$country==βSerbiaβ] =345
initial = left_join(x,y, by=c(βcountrycodeβ,βyearβ))
checking = anti_join(x,y, by=c(βcountrycodeβ,βyearβ))
table(checking$country)
#re-run the merge after correcting any issues
merged = left_join(x,y, by=c(βcountrycodeβ,βyearβ))