RM TEST 5 Flashcards

(9 cards)

1
Q

75.The data frame dt contains data on 1,491 Italian startups in their second year of existence (since their registration
in the Italian Business Register). Using regression, we will examine differences in the number of employees across
industries. The variable industry is coded according to the NACE classification into the following categories: Manuf
(manufacturing), ICT (information and communication technologies), ProfSciTech (professional, scientific and
technical activities), and Other (other). The output from R after estimation is as follows:
> coef(glm(employees ~ industry, data=dt, family=poisson))
(Intercept) industryICT industryProfSciTech industryOther
0.85845 0.00919 -0.46425 -0.09979
What is the predicted difference in the number of employees of startups in the industries of (i) ICT and (ii) professional,
scientific, and technical activities?
Hint: I would expect the answer in the form of a percentage difference. Students usually make mistakes when
working with exponentials and regression coefficients – think about the order in which you want to apply the
difference (or ratio) and the exponential. As a test that you have followed the correct procedure, try to calculate the
estimated mean value of the number of employees in both industries separately

A

The exponentiated coefficient for the ICT industry is exp(0.00919) ≈ 1.0092, which implies that, compared to manufacturing, startups in the ICT industry are predicted to have about 0.92% more employees (i.e., 1.0092 - 1 = 0.0092 or 0.92%).

Similarly, the exponentiated coefficient for the professional, scientific, and technical activities industry is exp(-0.46425) ≈ 0.6282, which implies that, compared to manufacturing, startups in the professional, scientific, and technical activities industry are predicted to have about 37.18% fewer employees (i.e., 1 - 0.6282 = 0.3718 or 37.18%). Thus, the estimated mean number of employees for startups in the ICT industry is approximately 7.93, and the estimated mean number of employees for startups in the professional, scientific, and technical activities industry is approximately 4.12.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. The data frame dt contains data on the revenues of 1,491 Italian startups in their second year of existence (since their registration in the Italian Business Register). Approximately 16% of these revenues are zero – these firms have not yet started selling products or services. Using the Tobit model, we investigate the dependence of revenues on the industry in which the firm operates. The variable industry is coded according to the NACE classification into the following categories: Manuf (manufacturing), ICT (information and communication technologies), ProfSciTech (professional, scientific and technical activities), and Other (other); revenue is measured in thousands of euros. The output from R after estimation is as follows

What is the predicted probability that a startup in the ICT industry will not yet be selling products during its second
year of existence?

A

To find the predicted probability that a startup in the ICT industry will not be selling products in its second year, we calculate the probability that the latent revenue* <= 0. With the ICT coefficient of -91 and intercept of 287, the expected value of revenue* for an ICT startup is 196. Given the error’s normal distribution with a standard deviation (scale) of 100, we find the z-score as (0 - 196) / 100 = -1.96. The standard normal CDF for -1.96 is about 0.025. So, the predicted probability is approximately 2.5%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. The analysis presented below uses data on the revenues of 1,491 Italian startups in their second year of existence (since their registration in the Italian Business Register). Approximately 16% of these revenues are zero – these firms have not yet started selling products or services. Using regression, we investigate the dependence of revenues on the industry in which the firm operates. The variable industry is coded according to the NACE classification into the following categories: Manuf (manufacturing), ICT (information and communication technologies), ProfSciTech (professional, scientific and technical activities), and Other (other); revenue is measured in thousands of euros. The following output from stargazer compares the estimation results from the linear regression model and the Tobit model
A

In the table, we have an OLS linear regression (Model 1) and a Tobit model (Model 2). The coefficients for industryProfSciTech have different interpretations:

  1. OLS: The coefficient (-189.00) represents the average difference in revenue (in thousands of euros) between firms in the ProfSciTech industry and the Manufacturing industry, holding other variables constant.
  2. Tobit: The coefficient (-213.00) represents the change in the latent (unobserved) revenue* for a one-unit change in the explanatory variable. To interpret the effect on observed revenue, we need to calculate marginal effects since the Tobit model accounts for censoring.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. Using probit, we estimated the probability of success as a function of respondents’ height (height, the only explanatory variable, measured in cm) using a data set where the average height is 180.3 cm. At this height, the predicted probability of success is exactly 50%. The estimated coefficient for the height variable is 0.10. Estimate the marginal effect of height for the average observation (MEM). Justify your approach.
A

MEM_j = f(x_bar * beta_hat) * beta_j_hat
Given that the predicted probability of success is 50% when the height is 180.3 cm, it means that the z-score at this point is 0 (since the cumulative distribution function (CDF) of a standard normal distribution is 0.5 at z=0). Therefore, the density at this point is at its maximum and equal to approximately 0.3989 (since the maximum value of the standard normal probability density function is 1/sqrt(2*pi) = 0.3989).

So, using the formula you provided, the marginal effect at the mean would be:

MEM_height = f(180.3 * 0.10) * 0.10
= 0.3989 * 0.10
= 0.03989
the marginal effect of height on the probability of success, evaluated at the average height of 180.3 cm, is approximately 0.03989. This means that, for individuals with an average height, an increase in height by one unit (one cm, in this case) is associated with an increase in the predicted probability of success by about 0.03989, or 3.989%, holding all other factors constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. In logistic regression, we used explanatory variables X = (1, x1, x2) the parameter estimate vector takes the form:
    beta_hat = (0.02, 0.03, -0.25)_transposed
    The predicted probability of success for a hypothetical observation with values of x1 and x2 at the level of sample means is 0.5. Estimate the marginal effect at mean (MEM) for x1
A

MEM_j = f(Xβ) * (1 - f(Xβ)) * β_j
So, the marginal effect of x1 at the mean is:

MEM_1 = 0.5 * (1 - 0.5) * 0.03 = 0.0075

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

69.Using probit, we model the probability of success using continuous variables x1, …., x9 on a data set of 191000 observations. Our goal is to estimate the marginal effects at mean (MEM) and average marginal effects (AME) for all variables. For which statistic will the calculation take longer, for MEM or AME? Justify your answer.

A

The calculation for Average Marginal Effects (AME) will take longer. This is because, for AME, we compute marginal effects for each of the 191,000 observations per variable and then average them.
In contrast, for Marginal Effects at the Mean (MEM), we calculate the marginal effects only once per variable at the mean values. Therefore, AME involves more computations, making it more time-consuming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. When and why is the Delta method used? Outline the basic idea on which the Delta method is based.
A

Basic idea (слайд сухаря):
This method is based on a first-order Taylor expansion approximation, here given by

The basic idea behind the Delta method is to use a first-order Taylor series expansion to approximate the function of the estimator. This allows us to express the variance of the function of the estimator in terms of the variance of the estimator itself .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. The relationship between an unobserved (latent) variable y and an observed variable y has the form
A

let’s consider the probability of y = 1:
P(y = 1) = P(y* > 0) = P(x*Beta + u > 0)

Since u|x ~ N(0,1), we can rewrite the probability as:
P(y = 1) = P(-u < x*Beta)

Since u follows a standard normal distribution, we can rewrite the probability as:
P(y=1) = P(−u < xβ)= F(xβ),
where F(⋅) is the cumulative distribution function (CDF) of the standard normal distribution.
Thus, the probit model is correctly specified, as it assigns the same conditional probabilities to the inputs as the latent variable model

Similarly, the probability of y = 0 is:
P(y = 0) = 1 - F(x*Beta)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. Explain the difference between truncated and censored data.
A

Censored data refers to having limited information beyond a certain boundary, while truncated data means excluding certain values based on predefined conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly