Flashcards in Descriptive analysis and linear regression Deck (77):
What is cross-sectional data?
Data collected simultaneously from a random sample e.g. household surveys, population census
What is time-series data?
Observations of a variable at different times e.g. monthly inflation
What is pooled cross-sectional data?
Observing different random samples at different times
What is panelled longitudinal data?
Observing the same random sample at different times
Write an example of a population regression function and a sample regression function.
Population: Yi = B1 + B2Xi + ui
Sample: Yi = b1 + b2Xi + ei
Interpret the terms Yi, Xi, ui, B1 and B2
Yi - the dependent/response/regressand variable
Xi - the independent/explanatory/regressor variable
ui - error term (we do not observe this)
B1 - the expected value of Yi when Xi=0
B2 - the average/expected change in Yi for a one-unit increase in Xi
Interpret E(Yi | Xi) = B1 + B2Xi
This is the conditional mean of Y given X which represent the deterministic component of the regression function.
OLS is a method for estimating the parameters in a linear function which contains a set of independent variables and a dependent variable.
It does this by minimising the sum of the differences between the observed dependent variable in the sample data set and the fitted/predicted value of the dependent variable given by the linear function.
In other words, the OLS method minimises the residual sum of squares.
OLS finds the pair of values, B1 and B2, which minimises the Residual/Error Sum of Squares (RSS/ESS)
What is a confidence interval and how is it calculated?
A confidence interval is a range of values that will contain a population parameter a specified proportion of the time, for example 95% of the time.
For example, the 95% confidence interval for B2:
[b2 - 1.96 x se(b2), b2 + 1.96 x se(b2)]
What are the seven classical assumptions of OLS?
1. Model is LINEAR in the regression coefficients (parameters)
2. Regressors are assumed to be FIXED (i.e. NON-STOCHASTIC)
5. UNCORRELATED ERRORS
6. NO MULTICOLLINEARITY
7. NORMALITY OF Ui
How can a population regression function be split up into components, and what are these components called?
Deterministic component: Yi = B1 + B2Xi
(for the sample regression function this part relates to the fitted values of Y, denoted by Y-hat)
Random component: ui
What do the standard errors of the coefficient estimates measure?
What does a large standard error imply?
Standard errors measure the variability of the estimates, and they are estimates of the standard deviation of the estimates.
The larger the standard error, the greater the variability and the less certainty there is about the true magnitude of the coefficient.
What does it mean that the regressors are fixed (non-stochastic?
If you assume that a probability distribution p(x) accurately describes the probability of that variable having each value it might have, it is a random variable. If you don't make any assumption about what value it has with what probability, it isn't a random variable.
Define exogeneity (also covered in part 2 of course)
Exogeneity means that there is no systematic relationship between the error terms, u, and the independent variable, X.
E(ui | X) = 0
Homoscedasticity means that ui has constant variance given X:
var(ui | Xi) = σ^2
Write out formally that errors are uncorrelated.
Uncorrelated errors: cov(ui, uj | X) = 0 ; i /= j
What does multicollinearity imply if there is just one regressor?
Multicollinearity means that there is a linear relationship between regressors.
For there to be no multicollinearity with one regressor, this means that Xi has to take at least two different values in the data.
Write out formally that ui follows a normal distribution.
Normality of ui: ui ∼ N (0, σ^2)
What does it mean if OLS is BLUE?
Best Linear Unbiased Estimators (BLUE).
This means that for linear functions, the OLS estimators are BEST with MINIMUM VARIANCE (i.e. efficient) and UNBIASED whereby on average, the ESTIMATED parameters are EQUAL to their TRUE VALUES i.e. E(bk) = Bk
Write out the formula for calculating the test statistic (t-ratio)
t = (b2 - B2)/se(b2)
Under what conditions do we reject the null hypothesis in a hypothesis test?
We reject the null hypothesis if the ABSOLUTE value of the t-ratio is GREATER than the CRITICAL VALUE, which is determined by the specified significance level.
Write out the equation for calculating the degrees of freedom
Degrees of freedom = n - k
n - number of observations
k - number of regression coefficients, including the intercept
Degrees of freedom are the number of values in a study which are free to vary. This is important for hypothesis testing, including chi-square, since it indicates the importances of the chi-square statistic and the validity of the null hypothesis.
What is a type 1 error in hypothesis testing?
A type 1 error is the INCORRECT REJECTION of a TRUE NULL hypothesis.
In other words a "false positive".
The type 1 error rate is the significance level (e.g. 5%)
What is a type 2 error in hypothesis testing?
How is a type 2 error denoted?
How does the type 2 error rate depend on the magnitude of the coefficient concerned?
A type 2 error is the FAILURE to REJECT a FALSE NULL hypothesis.
In other words a "false negative".
A type 2 error is denoted β and relates to the power of a test (power = 1 - β)
The type 2 error rate depends on the magnitude of the coefficient. If the coefficient is large we are more likely to reject the null.
Interpret the P-value
The p-value is how UNLIKELY it would be to see a T-RATIO of that magnitude if the NULL hypothesis were TRUE.
Alternatively, the p-value is the PROBABILITY of obtaining a result EQUAL to or MORE EXTREME than what was actually observed, given a TRUE NULL hypothesis.
For example, a p-value of 0.027 means that there is a 2.7% chance that an observation of this magnitude, or more extreme, would be obtained under the null. Therefore we do reject the null at the 5% level but not at the 1% level.
Given a particular p-value, how do we decide whether to reject the null hypothesis or not?
Given a particular p-value, we REJECT the NULL at any SIGNIFICANCE LEVEL GREATER than the p-value.
What is the notation for the significance level of a hypothesis test?
Significance level is denoted by α, alpha.
Define coverage probability
Coverage probability is the proportion of time that the confidence interval contains the true value of interest.
What is a dummy variable? What type of information do dummy variables capture?
A dummy variable only takes two values, usually 0 or 1, to indicate the absence or presence of some categorical effect that is expected to effect the outcome variable.
Dummy variables captures qualitative information such as gender, ethnicity etc.
Name three ways to test a hypothesis.
3. Confidence interval
Name one problem of a regression function that may lead to the exogeneity assumption not holding, and correlation being confused with causation?
Omitted variable bias, also known as endogeneity.
This ignores factors which effect the relationship (confounders) that will bias the estimates for the coefficients.
Define omitted variable vias or endogeneity.
What are the two types of bias that this can lead to?
Omitted variable bias occurs when a model leaves out one or more important factors, which leads to a 'bias' effect. The model compensates for the missing factor(s) by over or under-estimating the effect of the other factors on the outcome, therefore we will obtain incorrect and misleading estimates for the parameters contained within the regression function.
Positive bias, where the effect of one or more of the included regressors is over-estimated (i.e. the omitted variable is positively correlated with the outcome)
Negative bias, where the effect of one (or more) of the included regressors will be under-estimated (i.e. the omitted variable is negatively correlated with the outcome)
How do we capture a regressor which has a non-linear effect on the dependent variable in a regression function?
Include a quadratic term, such as
wagei = B1 + B2educi + B3(educi )^2 + ui
For the quadratic regression function:
wagei = B1 + B2educi + B3(educi )^2 + ui
Interpret the parameters B1, B2 and B3
B1 - represent the average/expected wages for an individual with 0 years of education
B2 - represents the slope of the function at educ=0
B3 - represents whether marginal returns are increasing (positive B3) or decreasing (negative B3) in education
Define what a marginal effect is and how we calculate this for linear and non-linear variables.
The marginal effect of a variable is the effect of increasing that variable by one unit on the dependent variable, holding all others equal/constant.
Marginal effect of a linear variable - this is simply the coefficient of the regressor
Marginal effect of a non-linear variable - this is found by taking the first partial derivative of the regression function with respect to that variable. The marginal effect will depend on the value of that variable (i.e. will change as the variable increases/decreases).
Define what an interaction is.
An interaction occurs when an independent/explanatory variable has a different effect on the dependent/outcome variable depending on the value taken by another independent/explanatory variable.
Give an example of when an interaction variable might be used in a regression function.
Interpret the coefficients in each regression function.
Example 1: To account for the difference the effect of education has on wages between men and women:
wage = B1 + B2*female + B3*education + B4*female*education + ui
B3 - represents the difference in wages associated with one extra year of education for men
B4 - represents the difference in wages associated with one extra year of education BETWEEN men and women (if B4 is negative, women earn that amount less than men)
Example 2: To account for the different the effect of bacteria has on the height of a plan at different levels of sunlight:
Height = B0 + B1*Bacteria + B2*Sun + B3*Bacteria*Sun + ui
B3 - represents the difference the effect of bacteria has on height
How do you transform a non-linear model into a linear one?
To transform a non-linear model into a linear one take natural logarithms of both sides and add an error term.
How do you interpret a log-log model?
The slope coefficients are elasticities. Each coefficient is the partial elasticity of the dependent/outcome variable with respect to the associated independent/explanatory variable, all else constant.
The sum of the coefficients indicates whether the function has constant returns to scale.
Constant returns to scale: sum of coefficients = 1
Increasing returns to scale: sum of coefficients>1
Decreasing returns to scale: sum of coefficients<1
How do you interpret a log-lin model?
The slope coefficients measure the relative change (percentage change) in the outcome variable for a 1-unit absolute change in the associated explanatory variable (this is known as semi-elasticity)
How do you interpret a lin-log model?
The slope coefficients measure the absolute change in the outcome variable for each percentage change in the associated explanatory variable.
Define elasticity and give an example of an elasticity and how it is calculated.
Elasticity is a ratio; the percentage change in one variable, divided by the percentage change in another.
For example, the price elasticity of a product, where price is denoted P and the quantity produced of the product is Q is given by:
price elasticity = %∆Q/%∆P
How do you test for constant returns to scale in a regression with multiple coefficients?
Create a hypothesis test:
H0: A1 + A2 = 1 (constant returns to scale)
H1: A1 + A2 /= 1 (not constant returns to scale)
The standard error associated with (a1 + a2) cannot be calculated manually because the estimation errors could be correlated*. However, we can use the lincom command in Stata to calculate these standard errors.
lincom x2 + x3
(even though we use the variable names in the command, Stata tests the significance of the linear combination of their coefficient estimates)
What does the lincom command in Stata do?
Lincom command reports the standard errors and confidence intervals for a linear combination of coefficients.
For example, lincom X2 + 3*X3 - 1
What does the coefficient of determination, R^2, measure and how do we calculate it?
What does R^2=0 and R^2=1 indicate?
R^2 measures the overall goodness-of-fit of the estimated regression, or alternatively the proportion of the total variation in the outcomes, Yi, that is explained by the regressors.
R^2 = (ESS/TSS) = 1 - (RSS/TSS)
R^2 = 0 implies no fit
R^2 = 1 implies perfect fit (all variation in the outcome variable is explained by the regressors)
What is the Total Sum of Squares (TSS) and how do we calculate it?
TSS tells you how well we can predict the outcomes without any regressors.
Alternatively, TSS represents how much variation there is in then dependent variable.
It is calculated by squaring the sum of the differences between the actual/observed outcomes minus the sample mean - in other words the sum of the squared deviations from the sample mean.
TSS = ∑(Yi - Y-bar)^2
What is the Explained/Model Sum of Squares (ESS) and how do we calculate it?
ESS tells you how much of variation in the outcome/dependent variable is explained by the regressors in the model.
It is calculated by squaring the sum of the differences between the predicted/fitted outcomes and the sample mean.
What is the Residual Sum of Squares (RSS) and how is it calculated?
RSS tells us how much of the variation in the outcome/dependent variable is not explained by the regressors in the model, and which is therefore captured in the error terms instead.
In other words, RSS tells everything we cannot predict.
It is calculated by squaring the sum of the error terms.
How are TSS, ESS and RSS related?
TSS = ESS + RSS
What happens to R^2 when more regressors are added to a regression function, why is this a problem and how can we remedy this?
R^2 will always increase even if the arbitrary regressors are added (i.e. indicating that the goodness of fit has improved, and more variation is explained by the regressors).
This makes it difficult to accurately compare the efficiency of different regression specifications with different numbers of regressors.
To enable us to compare different regression specifications accurately, we must use the adjusted R2 which adjusts for the number of regressors in a regression.
How do we calculate adjusted R^2?
adjR^2 = 1 - [(1-R^2)(n-1)/n-k-2]?
What values can adjusted R^2 take?
Adjusted R^2 can take both positive and negative values.
What is a necessary condition for using R^2 or adjusted R^2 to compare regression models?
The models must have the same dependent variable.
For example, if two models use wage and ln(wage) respectively, then R^2 or adjusted R^2 cannot be used to compare them.
Define an F-statistic/F-test of overall significance and state the null and alternative hypotheses.
The F-statistic tests the null hypothesis that all of the slope coefficients (not including the intercept) are simultaneously equal to zero.
Alternatively, the F-test compares a model with no predicators/regressors (called an intercept only model) to the specified model.
H0: the fit of the intercept only model and the the specified model are equal, meaning that none of the regressors explain the variation in the dependent variable.
H1: the fit of the intercept only model is significantly reduced than the specified model, meaning that at least one of the regressors explains some of the variation in the outcome variable.
How do we calculate the F-statistic?
F = [ESS/(k - 1)] / [RSS/(n - k)]
k - number of coefficients, including the regressor
n - number of observations
Multicollinearity is when one of the explanatory variables is a multiple regression model is a linear combination of some of the other explanatory variables.
In other words, variables are collinear if you can accurately predict one by knowing the values of the others (e.g. male/female are perfectly collinear; age/education are highly collinear)
What is the positive consequence of multicollinearity, and what are the negative consequences?
If there is multicollinearity then the OLS estimators are still BLUE.
However, the estimated impact of the regressors on the dependent variable is less accurate than if the regressors were uncorrelated because it is difficult to isolate the impact of each explanatory variable on the outcome variable.
The standard errors of the affected coefficients tend to be large, which results in small t-ratios and therefore it is more like to fail to reject a false null of no effect (Type 2 error).
There will be wider confidence intervals, but R^2 may still be high which is misleading.
Small changes in the sample data set can lead to large changes in the model.
What does no multicollinearity look like for a regression with only one regressor?
For a regression with only one regressors, no multicollinearity means Xi has to take at least two different values in the data.
Define perfect multicollinearity.
Perfect multicollinearity means that there is a linear relationship between two or more regressors, where one regressor is always equal to a linear combination of the others
Define imperfect multicollinearity.
Imperfect multicollinearity means that one dependent variable is a linear combination of other dependent variables plus a small error term.
In this case, distinguishing between the effects of different regressors is possible BUT may be very hard if they are strongly correlated
How can Stata help with collinearity problems?
Stata can detect perfect collinearity and will drop variables to remove this effect.
However, Stata does not drop variables when there is imperfect multicollinearity.
What are two solutions for a multicollinearity problem?
1. Find data that provides more independent variation of the variables
2. Remove non-essential variables - for example by combining related variables.
How many dummy variables do we need in order to distinguish between m categories?
To distinguish between m categories, we need to introduce (m-1) dummy variables in the regression if the regression model includes an intercept.
What is the dummy variable trap?
The dummy variable trap is where m dummy variables are included to distinguish between m categories, meaning that one of the dummy variables is redundant.
This means the independent dummy variables are perfectly or highly collinear.
To solve this problem, one of the collinear categories needs to be removed - it does not matter which as the one which is removed will be the reference category.
Give an example of a dummy variable trap.
An example of a dummy variable trap could be in the case of gender (male/female), dummy variables for both male AND female are introduced into the model. These are perfectly collinear since if you know the value of one variable, you can accurately predict the value of the other.
Heteroscedasticity is where the variability of the error terms is unequal across the range of values of X (the predictor variable).
var(ui | Xi) = σi^2
What are the consequences of heteroscedasticity?
OLS estimators remain consistent and unbiased.
However, we get incorrect standard errors, which leads to incorrect t-tests and F-tests. Additionally, OLS is not longer efficient with minimum variance.
What are three solutions for the heteroscedasticity problem?
1. Change the specification of the model
e.g. logarithmic transformation - this helps because variables that present heteroscedasticity tend to have higher levels of variation for higher values, therefore taking the log smooths this trend
2. Compute heteroscedasticity-robust standard errors (AKA Hiber-White or HC standard errors)
3. Use Weighted Least Squares to estimate the regression
What are heteroscedasticity-robust standard errors and what are they used for?
Heteroscedasticity-robust standard errors (aka Huber-White or HC standard errors) correct standard errors, t-statistics, p-values and confidence intervals to allow for heteroscedasticity WITHOUT CHANGING the OLS coefficients.
HC standard errors can be calculated using Stata.
It is advisable to always use robust standard errors.
What is the disadvantage of heteroscedasticity-robust standard errors?
The disadvantage of HC standard errors is that you need large samples to estimate them accurately - though there is no precise rule.
What is Weighted Least Squares and what is it used for?
Weighted Least Squares is another approach in regression analysis, like OLS, to approximate the unknown parameters in a regression model.
WLS can be more efficient than OLS in the presence of heteroscedasticity - containing less variance.
WLS weights each observation by an estimate of the standard errors, which reduces the significance of 'noisy' observations with large standard errors.
However, the true standard errors are not known, so a conjecture is used.
This approach is not widely used.
Name two different types of problem associated with regression analysis, and explain their consequences.
1. Inclusion of irrelevant variables/regressors in the regression model. This leads to...
- inferences become less precise BUT are not invalidated
- estimates remain unbiased and consistent
- estimates become inefficient (large variance and standard errors)
- coefficient estimates become less precise
- possible problems with multicollinearity
2. Omission of relevant variables (i.e. omitted variable bias)
This relates to the exogeneity assumption in classical OLS...
- exogeneity assumption no longer holds E(ui | Xi) /= 0 .
- in other words the model is not 'correctly specified' and the error terms will be correlated with the regressors.
What is the skewness of a normal distribution?
The skewness of a normal distribution is zero.
What do negative values for skewness indicate?
Negative values for skewness indicate left skew (i.e. large/long left tail).
What do positive values of skewness indicate?
Positive values for skewness indicate right skew (i.e. large/long right tail)
What is kurtosis and what is the kurtosis value for a standard normal distribution?
Kurtosis is a measure of tail thickness.
The kurtosis value for a standard normal distribution is three.