Regression Analysis Flashcards

1
Q

Do you know what the standard error of the coefficient captures?

A

The standard error of the coefficient is the standard deviation of an estimate. It measures how precisely the model estimates the coefficient’s unknown value. The standard error of the coefficient is always positive.
The smaller the standard error, the more precise the estimate. Dividing the coefficient by its standard error calculates a t-value. If the p-value associated with this t-statistic is less than your alpha level, you conclude that the coefficient is significantly different from zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Do you know how to calculate the t-statistic using a formula?

A

t=B1/SE(B1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Do you know what is a p-value and what it tells us?

A

A p-value is a statistical measurement used to validate a hypothesis against observed data.
The p-value is the probability that the null hypothesis is true. (1 – the p-value) is the probability that the alternative hypothesis is true. A low p-value shows that the results are replicable.
The lower the p-value, the greater the statistical significance of the observed difference.
p-value = the probability that F-value (or z-score) is above the corresponding critical value (the one that we find in the table in the book)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Do you know the difference between t-tests and F-tests?

A

F-test and T-test are the two statistical test used for hypothesis testing. They assist the researchers to decide whether to accept the null hypothesis or reject it.

  • T-test is a univariate hypothesis test, that is applied when standard deviation is not known and the sample size is small.
  • F-test is statistical test, that determines the equality of the variances of the two normal populations.
  • T-statistic follows Student t-distribution, under null hypothesis.
  • F-statistic follows Snedecor f-distribution, under null hypothesis.
  • T-test is used to compare the means of two populations.
  • F-test is used to compare two population variances.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Do you know the null and alternative hypotheses behind the p-value of the F-test in Stata outputs? They are reported in the top right corner.

A

P-value:
- H0: p-value > α
- Ha: p-value ≤ α

F-test:
- H0: βs are jointly = 0
- Ha: At least one β is different from 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Do you know how to conduct an F-test in Stata using the test command after running a regression?

A

Yes. You just write the command test + the independent variables that you want to check (you want to check if it makes sense to include them in the model i.e. if not including them could induce OVB). You need to include them if p-value<0.05.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Do you know how to use the factor variable notations in Stata?

A

Factor variables are categorical variables. Need to add i. before the variable(s) when running a regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is heteroskedasticity?

A

There is heteroskedasticity when the variance for residuals across x are unequal (homoskedasticity is the opposite).
One of the assumtpions of OLS is that there is no heteroskedasticity. To control for heteroskedasticity we add the option “robust” or “r” in the regression code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is heteroskedasticity a problem?

A

Heteroskedasticity refers to a situation where the variance of the residuals is unequal over a range of measured values.
If heteroskedasticity exists, the population used in the regression contains unequal variance, the analysis results may be invalid.
Models involving a wide range of values are supposedly more prone to heteroskedasticity.
Heteroskedasticity is a problem because ordinary least squares (OLS) regression assumes that all residuals are drawn from a population that has a constant variance (homoskedasticity) –> one of the assumptions of the Gauss-Markov Theorem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are robust standard errors?

A

They are the standard errors that we obtain when we put the r option (when we control for heteroskedasticity). “Robust” standard errors is a technique to obtain unbiased standard errors of OLS coefficients under heteroscedasticity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is OVB? What are the 2 conditions for OVB? Why is OVB a problem?

A

OVB arises when a relevant variable is omitted from the regression.

There are two conditions that may induce OVB:
1. The omitted variable (Z) is a determinant of Y (i.e. Z is part of u); and
2. Z is correlated with the regressor X (i.e. corr(Z,X) ≠ 0)

Having an omitted variable in research can bias the estimated outcome of the study and lead the researcher to an erroneous conclusion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Do you know the differences between perfect and imperfect collinearity?

A

Perfect multicollinearity: When an independent variable or a set of independent variables predict the value of another independent variable perfectly. There is redundant information. One implies the other (i.e. male-female)

Imperfect multicollinearity: two independent variables are highly correlated (relationship is not perfect). i.e. height and weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a dummy variable trap?

Why is it a problem?

A

Dummy Variable Trap: When the number of dummy variables created is equal to the number of values the categorical variable can take on.

It is a problem because it leads to multicollinearity, which causes incorrect calculations of regression coefficients and p-values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Do you know why collinearity is problematic?

A

A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other independent variables constant. However, when independent variables are correlated, it indicates that changes in one variable are associated with shifts in another variable. The stronger the correlation, the more difficult it is to change one variable without changing another. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently because the independent variables tend to change in unison.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is regression?

A

A stastistical method that uses data to test whether a relationship exists between two or more variables, and to quantify it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the objectives of regression analysis?

A

1) To estimate the effect of an independent variable to a dependent variable;

2)To test whether the effect is statistically different from zero (or a certain value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a random variable?

A

A variable which values are based on an outcome of a probabilistic event

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does it mean when we say that X has a linear relationship with Y?

A

It means that at all levels of X we have a proportional effect on Y –> at all levels the degree of change is the same (same slope).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is SSE?

A

Errors (or residuals) in regression are the distances between our prediciton (the fitted line) and the real, observed values. We square the errors to make sure that the under predictions will not cancel out the over predictions (or vice versa). It is a measure of the discrepancy between the data and an estimation model. A small RSS (SSE) indicates a tight fit of the model to the data.

20
Q

What does OLS mean?

A

It means Ordinary Least Squares –> a method to fit a line to data, so that the sum of squared residuals (SSR or SSE, it’s the same thing) is as small as possible.

21
Q

What is R-squared? How to interpret it?

A

It is the proportion of the variance of Y explained by X. Because it is a proportion, R-squared is bounded between 0 and 1. If R2 = 1, it means that the Sum of Squared Residual (SSR)=0 → all the sample data (all the observations) will fall exactly on the regression line. In practice, it is not used.

22
Q

How the formula of the t-statistic is related to the probability distribution of a random variable that follows a normal distribution?

A

The t-distribution describes the standardized distances of sample means to the population mean when the population standard deviation is not known, and the observations come from a normally distributed population. The standard normal or z-distribution assumes that you know the population standard deviation. The t-distribution is based on the sample standard deviation.

23
Q

What is a confidence interval?

Do you know how to interpret the confidence interval of the coefficient?

A

A confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated confidence level; the 95% confidence level is most common, but other levels, such as 90% or 99%, are sometimes used. The confidence level represents the long-run proportion of corresponding CIs that contain the true value of the parameter. For example, out of all intervals computed at the 95% level, 95% of them should contain the parameter’s true value. We are 95% confident that the true vaue of β1 falls within the confidence interval.

A confidence interval indicates where the population parameter is likely to reside. For example, a 95% confidence interval of the mean [9 11] suggests you can be 95% confident that the population mean is between 9 and 11.

24
Q

How to interpret the slope coefficients in a multiple regression model?

A

The coefficient tells you the marginal effect of a regressor. β1 is the partial effect of X1 on Y when X1 increases by 1 unit, holding other factors constant.

25
Q

Goodness of fit in multiple regression –> how to interpret the adjusted R-squared?

A

In simple regression, there is only one regressor (one independent variable). As we add more regressors, SSR increases. The adjusted R-squared is a modified version of R-squared that accounts for predictors that are not significant in a regression model. In other words, the adjusted R-squared shows whether adding additional predictors improve a regression model or not.
If adjusted R2 = 1, it means that the Sum of Squared Residual (SSR)=0 → all the sample data (all the observations) will fall exactly on the regression line (same interpretation as in simple regression).

26
Q

What is a joint hypothesis test?

A

It is a a hypothesis test involves multiple coefficients.

Example:
H0: β1 = β2 = 0
Ha: at least one βi ≠ 0

27
Q

Why do you run an F-test? How do you do that in Stata?

A

We run an F-test to test whether at least one of the independent variables that we added to our regression affect the dependent variable Y. F-test checks if at least on of the βi is ≠ 0. to run the F-test we first write the regression of the unrestricted model (the one with the added variables) and in the next line we write “test” and the names of the added independent variables 8except for the main independent). If the p-value<0.05, we reject the null hypothesis (that all βs are = 0) –> this means that at least one is ≠ 0, thus w include them all in the model (because we can’t know wich one is ≠ 0).

28
Q

Do you know what the beta coefficient of an independent variable tells us?

A
  • In simple regression, multiple regression, interaction effects, polynomial and logarithmic: β1 is the marginal effect of X on Y.
  • In models with binary dependent variable: β1 is the effect on the z-score (we need margins to see the effect on the probability)
  • In models with categorical outcomes: The coefficient is alternative specific → for each outcome you will have different coefficients (different βs). We need margins to se how much it affects the probability of having one outcome.
  • In models with ordinal outcomes: β1 only indicates positive/negative effect on the probability of having a certain outcome. Need margins to see how much.
29
Q

Do you know what residuals or errors mean?

A

Errors (or residuals) in regression are the distances between our prediciton (the fitted line) and the real, observed values.

30
Q

Can you state a few assumptions behind OLS estimation?

A

Five assumptions (out of 12) of the Gauss-Markov Theorem –> The estimator is BLUE (best linear unbiased estimator) when:

1) Linearity: the parameters we are estimating using the OLS method must be themselves linear.

2) Randomness: our data must have been randomly sampled from the population.

3) Non-Collinearity: the regressors being calculated aren’t perfectly correlated with each other.

4) Exogeneity: the regressors aren’t correlated with the error term.

5) Homoskedasticity: no matter what the values of our regressors might be, the error of the variance is constant.

31
Q

Do you know the difference between beta and b?

A

Beta is the real coefficient (the population coefficient) and we can’t know it. B is the estimate that we can calculate through regression.

32
Q

Do you know how to interpret coefficient in a linear-log, log-linear and log-log regression model?

A
  • Linear-log: A 1% increase in X1 results in a 0.01*b1 increase/decrease in Y, on average.
  • Log-linear: A one unit increase in X1 results in a 100*b1 percentage points increase/decrease in Y, on average.
  • Log-log: A 1% increase in X1 results in a b1% increase in Y, on average.
33
Q

Do you know how to use factor variable notations in Stata to estimate a regression model?

A

Add i. before binary or categorical variables, and c. before continuous variables.

34
Q

Do you know how to perform an F-test after an estimation using factor variable notations in Stata?

A

testparm i.year (to test whether we need to include time-fixed effects in the model). If p-value<0.05, we include them.

35
Q

Do you know when you should apply which nonlinear models (probit/logit, multinomial logit, ordered probit/logit)?

A
  • Probit/logit: binary outcome
  • Multinomial logit: categorical outcome
  • Oprobit/ologit: ordinal outcome
36
Q

Do you know the difference between percent and percentage points?

A

A percentage point is the simple numerical difference between two percentages.
A percentage is a number or ratio expressed as a fraction of 100.
An increase from 40 per cent to 50 per cent will often be described as a 10 per cent increase. However, it is a 10 percentage point increase and a 25 per cent increase which is quite a difference.

37
Q

Do you know the advantages of panel data?

A

Panel data are helpful to mitigate OVB, but only OVB that are related to unobserved heterogeneity (which is fixed across units OR over time), we cannot control for the bias which varies across units and time simultaneously (i.e. national deficit, varies over time and across countries, and can still bias our analysis).

38
Q

In which way a fixed-effects model is superior to pooled OLS?

A

Pooled OLS: used when you have different surveys merged together

OLS - limitations: omitted-variable bias, does not take into account for fixed effects (i.e. region/city specific effects), individual heterogeneity

39
Q

Do you know how least square dummy variable and within transformation work?

A
  • LSDV only controls for time FE, by adding a dummy variable for each year/months etc (i.time_var);
  • Within transformation only controls for entity FE (xtreg …, fe r).
40
Q

Do you know how to set up everything in Stata before estimation?

A

If the dataset isnt’ in long format –> reshape.
Then –> xtset unit_var time_var

41
Q

Do you know how to estimate a panel data model and interpret coefficients?

A

In a panel data model we add the subsctipt “t” to the variables and to the error term.
The error term u_it can be divided into a_i adn v_it. The error term a_i only captures unobserved heterogeneity across units (not across time).
The estimation method depends on your data and on your research question.

  • LSDV: when you only want to control for time-fixed effects (unit invariant) –> i.e. EU fiscal policyes (vary over time but not over entity, they are always applied to all EU members)
  • Within transformation: when you only want to control for unit-fixed effects (time-invariant) –> i.e. geography (varies over countries but not over time)
  • Two-way (LSDV + Within trans.): when you want to control for both.
42
Q

Do you know why we cluster standard errors?

A

The error term u in a model is likely to be correlated within unit –> i.e. income level which explains life satisfaction but is not included in the model, will be contained in the error term u. The errors for the same individual are likely to be correlated across time (i.e. if Mark is rich in 1998, he will probably be rich also in 1999 and 2000), however between individuals, the correlation is less likely to be correlated (i.e. Mark’s income in 1998 is unlikely to be correlated with Lisa’s income in 2000). Correlations of the errors within the same individual (or entity, i.e. class, country etc.) can affect the standard error of the estimated coefficient. This is because, although we have i.e. 4 obstervations for each individual (or entity), information we can get from 4 correlated observations is less than information we can get from 4 independent observations –> some adjustments need to be done in calculating the standard errors. The resulting standard errors after the adjustment is called clustered standard errors. Like the heteroskedasticity robust standard errors, clustered standard errors will only affect the result of a statistical test, they will not affect the size of coefficients. We cluster the standard error in Stata by adding cluster(unit_varname) as an option in the regression line (after r or fe r) –> we can cluster standard errors also without adding time-fixed effects.

We cluster standard errors to take into account the error terms correlated in each cluster and specific to the cluster (withing group correlation)

43
Q

Do you know what is endogeneity?
Why is it a problem?

A

There is endogeneoity when one independent variable is correlated with the error term This can happen in the case of OVB, measurement error and/or reverse causality (simultaneity).

Endogeneity is a problem because it could bias our estimate, thus our results can no longer be generalized to the real population.

44
Q

Do you know the difference between external and internal validity?

A
  • Internal validity: when there is internal validity causal inferences are generalizable to the population we are studying. We need to ask ourselves whether the sample is representative and whether statistical inferences about causal effects are valid for the population being studied.
  • External validity: are we able to generalize these results? Are results generalizable to other populations and settings? One issue is whether findings in a specific experiment (in a specific context) can be generalized (i.e. the training program for Indian women).
45
Q

Do you know what are omitted variable bias, functional form misspecification, selection bias, simultaneous bias, and measurement errors?

A

They are the five threats to internal validity.

  • OVB: when we omit a variable that is correlated with another independent variable and has an effect on the independent variable.
  • Functional form misspecification: when we chose the wrong model for the regression (it may also be a form of OVB).
  • Selection bias (missing data): affects internal validity if the data that are missing are conditional on the dependent or error term OR if the missing data are due to systematic reasons.
  • Simultaneous bias (reverse causality): when x has an effect o y, but also y has an effect on x.
  • Measurement error: bias when the measurement error is systematic (not random) and/or it’s correlated with an independent variable.
46
Q

Do you know when missing data and measurement errors will induce bias?

A

They will induce bias when they are the result of a systematic error and they won’t be a problem if they are random.