Quizzes Flashcards

Question 1

Q

Is it generally better to merge with country codes instead of country names? If so, why?

Answer

A

For merging, country codes are better because country names can be spelled differently in how they want to be represented across different platforms. In contrast, country codes are more concrete.

Question 2

Q

How do covariance and correlation relate to each other?

Answer

A

They relate because you need the covariance to get the correlation,
cov𝑥,𝑦=N∑i=1 (x-x̄)(y-ȳ) / N-1

the steps are–
1) mean of x
mean of y
2) subtract by mean for x
subtract by mean for y
3) multiple x by y
4) sum up
5) divide N-1

all to get–
cor=cov(x,y)/𝜎𝑥𝜎𝑦

Question 3

Q

A canonical example with which to learn linear regression is through the relationship between income and education. In that relationship, between income and education, which one is the dependent variable and why?

Answer

A

Income is the dependent variable because it is the Y variable on the y-axis. Income is dependent on your level of education, higher education leads to higher income, with some data points on the line of around it (residual is the distance between the data pt & the line).
Yincome= α + Xeducation + €

Question 4

Q

If I were to tell you that the price of going to the movies in 1950 was $1 and movie prices today was $15, why would the two prices not be comparable? Without providing any R code (ie: explain your process in words), how would you make the prices comparable?

Answer

A

The two prices would not be comparable because of data inflation, you need to adjust for inflation during “normal” years to make it comparable & to do that:
1) get GDP deflator from WDI
2) figure out how the US is written given we want US dollars
3) only get US data & rename deflator
4) merged deflated revenue
5) take original, calculate (revenue/deflator) x100 & compare

Question 5

Q

In the videos/lecture notes, you were presented with the following linear regression, testing whether Daniel Craig being in the James Bond movie, the
age of the film (as of 2024), and the log deflated revenue of the movie are associated with improved average audience scores from Rotten Tomatoes:
Interpret the regression, focusing on the coeﬀicient estimates, p-values, 𝑅2, Adjusted 𝑅2, and the 𝐹 statistic

Answer

A

The above output suggests that log_deflated_revenue is the most consistent predictor of Rotten Tomatoes audience scores. We can make that assessment because it has the lowest 𝑝-value among the other two independent variables, age and star_Daniel_Craig.
We should also be careful to interpret the log_deflated_revenue variable correctly, and Table 5.1 from Quan Li’s textbook helps us toward that end. In this case, a 1% increase in revenue is associated with a 20 percentage point increase in Rotten Tomatoes audience score. The latter is on a 0-100 scale (see summary() above), which is why I explain the
effect as such.

With respect to star_Daniel_Craig, note that the 𝑝-value increases from 0.057 in our bivariate regression above to 0.12 in this multivariate regression. That means that the star_Daniel_Craig variable is now very far from our 0.05 threshold for statistical significance. Of course, we should be careful not to interpret the 0.05 threshold in binary form, but that change in that 𝑝-value is still noteworthy. In any case, star_Daniel_Craig is still quite positive substantively with a coeﬀicient estimate of 0.20, suggesting that
movies with Daniel Craig may increase Rotten Tomatoes audience scores by 20 percentage points. For its part, age has a 𝑝-value of 0.08 with a negligible coeﬀicient of 0.004.

Now, let’s talk about model statistics. The 𝑅2 statistic is 0.3956, suggesting that the three independent variables explain almost 40% of the variation in Rotten Tomatoes audience scores. That is actually quite a lot. The Adjusted 𝑅2 statistic is 0.31. Essen-
tially, when we account for the potential bias in how 𝑅2 is calculated, the amount of true variation in audience scores by our three independent variables is not 0.3956 but 0.31. The latter is actually still quite high. Sometimes, the Adjusted 𝑅2 statistic can even be negative, which is a major red flag regarding the quality of your model.

The 𝐹 -statistic is 5.017 with a 𝑝-value of 0.008. Given that the latter is highly statistically significant, it means that the three variables we have selected are most certainly better than an intercept-only, null/empty model. That is what an 𝐹 -statistic tell us. If
we are unable to reject the null hypothesis in the 𝐹 -test, then our choice of model is generally bad. Basically, we want to have the biggest 𝐹 -statistic possible, and it should get bigger when we add new variables to a model.

Question 6

Q

In the videos/lecture notes, you were presented with the following linear regression, testing whether Daniel Craig being in the James Bond movie
improved average audience scores from Rotten Tomatoes:
Interpret the regression, focusing on the coeﬀicient estimate, p-value, and 𝑅2.

Answer

A

The coeﬀicient suggests that having Daniel Craig in the movie increased Rotten tomato scores by 0.158 or 15.8% percent. Although it just barely missed statistical significance (p value is just greater than .05), it is so close that we should interpret the result as
having an effect. The 𝑅2 also suggests that the Daniel Craig dummy variable explains almost 14 percent of the variation in the outcome, which is quite high. In short, having Daniel Craig in the film is associated with a higher Rotten Tomatoes score.