Descriptive analysis and linear regression Flashcards

Question

Interpret the P-value

Answer 1

The p-value is how UNLIKELY it would be to see a T-RATIO of that magnitude if the NULL hypothesis were TRUE. Alternatively, the p-value is the PROBABILITY of obtaining a result EQUAL to or MORE EXTREME than what was actually observed, given a TRUE NULL hypothesis. For example, a p-value of 0.027 means that there is a 2.7% chance that an observation of this magnitude, or more extreme, would be obtained under the null. Therefore we do reject the null at the 5% level but not at the 1% level.

Answer 2

Given a particular p-value, we REJECT the NULL at any SIGNIFICANCE LEVEL GREATER than the p-value.

Answer 3

Significance level is denoted by α, alpha.

Answer 4

Coverage probability is the proportion of time that the confidence interval contains the true value of interest.

Answer 5

A dummy variable only takes two values, usually 0 or 1, to indicate the absence or presence of some categorical effect that is expected to effect the outcome variable. Dummy variables captures qualitative information such as gender, ethnicity etc.

Answer 6

1. T-ratio 2. P-value 3. Confidence interval

Answer 7

Omitted variable bias, also known as endogeneity. | This ignores factors which effect the relationship (confounders) that will bias the estimates for the coefficients.

Answer 8

Omitted variable bias occurs when a model leaves out one or more important factors, which leads to a 'bias' effect. The model compensates for the missing factor(s) by over or under-estimating the effect of the other factors on the outcome, therefore we will obtain incorrect and misleading estimates for the parameters contained within the regression function. Positive bias, where the effect of one or more of the included regressors is over-estimated (i.e. the omitted variable is positively correlated with the outcome) Negative bias, where the effect of one (or more) of the included regressors will be under-estimated (i.e. the omitted variable is negatively correlated with the outcome)

Answer 9

Include a quadratic term, such as | wagei = B1 + B2educi + B3(educi )^2 + ui

Answer 10

B1 - represent the average/expected wages for an individual with 0 years of education B2 - represents the slope of the function at educ=0 B3 - represents whether marginal returns are increasing (positive B3) or decreasing (negative B3) in education

Answer 11

The marginal effect of a variable is the effect of increasing that variable by one unit on the dependent variable, holding all others equal/constant. Marginal effect of a linear variable - this is simply the coefficient of the regressor Marginal effect of a non-linear variable - this is found by taking the first partial derivative of the regression function with respect to that variable. The marginal effect will depend on the value of that variable (i.e. will change as the variable increases/decreases).

Answer 12

An interaction occurs when an independent/explanatory variable has a different effect on the dependent/outcome variable depending on the value taken by another independent/explanatory variable.

Answer 13

Example 1: To account for the difference the effect of education has on wages between men and women: wage = B1 + B2*female + B3*education + B4*female*education + ui B3 - represents the difference in wages associated with one extra year of education for men B4 - represents the difference in wages associated with one extra year of education BETWEEN men and women (if B4 is negative, women earn that amount less than men) Example 2: To account for the different the effect of bacteria has on the height of a plan at different levels of sunlight: Height = B0 + B1*Bacteria + B2*Sun + B3*Bacteria*Sun + ui B3 - represents the difference the effect of bacteria has on height

Answer 14

To transform a non-linear model into a linear one take natural logarithms of both sides and add an error term.

Answer 15

The slope coefficients are elasticities. Each coefficient is the partial elasticity of the dependent/outcome variable with respect to the associated independent/explanatory variable, all else constant. The sum of the coefficients indicates whether the function has constant returns to scale. Constant returns to scale: sum of coefficients = 1 Increasing returns to scale: sum of coefficients>1 Decreasing returns to scale: sum of coefficients<1

Answer 16

The slope coefficients measure the relative change (percentage change) in the outcome variable for a 1-unit absolute change in the associated explanatory variable (this is known as semi-elasticity)

Answer 17

The slope coefficients measure the absolute change in the outcome variable for each percentage change in the associated explanatory variable.

Answer 18

Elasticity is a ratio; the percentage change in one variable, divided by the percentage change in another. For example, the price elasticity of a product, where price is denoted P and the quantity produced of the product is Q is given by: price elasticity = %∆Q/%∆P

Answer 19

Create a hypothesis test: H0: A1 + A2 = 1 (constant returns to scale) H1: A1 + A2 /= 1 (not constant returns to scale) The standard error associated with (a1 + a2) cannot be calculated manually because the estimation errors could be correlated*. However, we can use the lincom command in Stata to calculate these standard errors. lincom x2 + x3 (even though we use the variable names in the command, Stata tests the significance of the linear combination of their coefficient estimates)

Answer 20

Lincom command reports the standard errors and confidence intervals for a linear combination of coefficients. For example, lincom X2 + 3*X3 - 1

Answer 21

R^2 measures the overall goodness-of-fit of the estimated regression, or alternatively the proportion of the total variation in the outcomes, Yi, that is explained by the regressors. R^2 = (ESS/TSS) = 1 - (RSS/TSS) ``` R^2 = 0 implies no fit R^2 = 1 implies perfect fit (all variation in the outcome variable is explained by the regressors) ```

Answer 22

TSS tells you how well we can predict the outcomes without any regressors. Alternatively, TSS represents how much variation there is in then dependent variable. It is calculated by squaring the sum of the differences between the actual/observed outcomes minus the sample mean - in other words the sum of the squared deviations from the sample mean. TSS = ∑(Yi - Y-bar)^2

Answer 23

ESS tells you how much of variation in the outcome/dependent variable is explained by the regressors in the model. It is calculated by squaring the sum of the differences between the predicted/fitted outcomes and the sample mean.

Answer 24

RSS tells us how much of the variation in the outcome/dependent variable is not explained by the regressors in the model, and which is therefore captured in the error terms instead. In other words, RSS tells everything we cannot predict. It is calculated by squaring the sum of the error terms. ∑ei^2

Answer 25

TSS = ESS + RSS

Answer 26

R^2 will always increase even if the arbitrary regressors are added (i.e. indicating that the goodness of fit has improved, and more variation is explained by the regressors). This makes it difficult to accurately compare the efficiency of different regression specifications with different numbers of regressors. To enable us to compare different regression specifications accurately, we must use the adjusted R2 which adjusts for the number of regressors in a regression.

Answer 27

adjR^2 = 1 - [(1-R^2)(n-1)/n-k-2]?

Answer 28

Adjusted R^2 can take both positive and negative values.

Answer 29

The models must have the same dependent variable. | For example, if two models use wage and ln(wage) respectively, then R^2 or adjusted R^2 cannot be used to compare them.

Answer 30

The F-statistic tests the null hypothesis that all of the slope coefficients (not including the intercept) are simultaneously equal to zero. Alternatively, the F-test compares a model with no predicators/regressors (called an intercept only model) to the specified model. H0: the fit of the intercept only model and the the specified model are equal, meaning that none of the regressors explain the variation in the dependent variable. H1: the fit of the intercept only model is significantly reduced than the specified model, meaning that at least one of the regressors explains some of the variation in the outcome variable.

Answer 31

F = [ESS/(k - 1)] / [RSS/(n - k)] where: k - number of coefficients, including the regressor n - number of observations

Answer 32

Multicollinearity is when one of the explanatory variables is a multiple regression model is a linear combination of some of the other explanatory variables. In other words, variables are collinear if you can accurately predict one by knowing the values of the others (e.g. male/female are perfectly collinear; age/education are highly collinear)

Answer 33

If there is multicollinearity then the OLS estimators are still BLUE. However, the estimated impact of the regressors on the dependent variable is less accurate than if the regressors were uncorrelated because it is difficult to isolate the impact of each explanatory variable on the outcome variable. The standard errors of the affected coefficients tend to be large, which results in small t-ratios and therefore it is more like to fail to reject a false null of no effect (Type 2 error). There will be wider confidence intervals, but R^2 may still be high which is misleading. Small changes in the sample data set can lead to large changes in the model.

Answer 34

For a regression with only one regressors, no multicollinearity means Xi has to take at least two different values in the data.

Answer 35

Perfect multicollinearity means that there is a linear relationship between two or more regressors, where one regressor is always equal to a linear combination of the others

Answer 36

Imperfect multicollinearity means that one dependent variable is a linear combination of other dependent variables plus a small error term. In this case, distinguishing between the effects of different regressors is possible BUT may be very hard if they are strongly correlated

Answer 37

Stata can detect perfect collinearity and will drop variables to remove this effect. However, Stata does not drop variables when there is imperfect multicollinearity.

Answer 38

1. Find data that provides more independent variation of the variables 2. Remove non-essential variables - for example by combining related variables.

Answer 39

To distinguish between m categories, we need to introduce (m-1) dummy variables in the regression if the regression model includes an intercept.

Answer 40

The dummy variable trap is where m dummy variables are included to distinguish between m categories, meaning that one of the dummy variables is redundant. This means the independent dummy variables are perfectly or highly collinear. To solve this problem, one of the collinear categories needs to be removed - it does not matter which as the one which is removed will be the reference category.

Answer 41

An example of a dummy variable trap could be in the case of gender (male/female), dummy variables for both male AND female are introduced into the model. These are perfectly collinear since if you know the value of one variable, you can accurately predict the value of the other.

Answer 42

Heteroscedasticity is where the variability of the error terms is unequal across the range of values of X (the predictor variable). var(ui | Xi) = σi^2

Answer 43

OLS estimators remain consistent and unbiased. However, we get incorrect standard errors, which leads to incorrect t-tests and F-tests. Additionally, OLS is not longer efficient with minimum variance.

Answer 44

1. Change the specification of the model e. g. logarithmic transformation - this helps because variables that present heteroscedasticity tend to have higher levels of variation for higher values, therefore taking the log smooths this trend 2. Compute heteroscedasticity-robust standard errors (AKA Hiber-White or HC standard errors) 3. Use Weighted Least Squares to estimate the regression

Answer 45

Heteroscedasticity-robust standard errors (aka Huber-White or HC standard errors) correct standard errors, t-statistics, p-values and confidence intervals to allow for heteroscedasticity WITHOUT CHANGING the OLS coefficients. HC standard errors can be calculated using Stata. It is advisable to always use robust standard errors.

Answer 46

The disadvantage of HC standard errors is that you need large samples to estimate them accurately - though there is no precise rule.

Answer 47

Weighted Least Squares is another approach in regression analysis, like OLS, to approximate the unknown parameters in a regression model. WLS can be more efficient than OLS in the presence of heteroscedasticity - containing less variance. WLS weights each observation by an estimate of the standard errors, which reduces the significance of 'noisy' observations with large standard errors. However, the true standard errors are not known, so a conjecture is used. This approach is not widely used.

Answer 48

1. Inclusion of irrelevant variables/regressors in the regression model. This leads to... - inferences become less precise BUT are not invalidated - estimates remain unbiased and consistent - estimates become inefficient (large variance and standard errors) - coefficient estimates become less precise - possible problems with multicollinearity 2. Omission of relevant variables (i.e. omitted variable bias) This relates to the exogeneity assumption in classical OLS... - exogeneity assumption no longer holds E(ui | Xi) /= 0 . - in other words the model is not 'correctly specified' and the error terms will be correlated with the regressors.

Answer 49

The skewness of a normal distribution is zero.

Answer 50

Negative values for skewness indicate left skew (i.e. large/long left tail).

Answer 51

Positive values for skewness indicate right skew (i.e. large/long right tail)

Answer 52

Kurtosis is a measure of tail thickness. | The kurtosis value for a standard normal distribution is three.

Answer 53

Normal distributions are continuous.

Descriptive analysis and linear regression Flashcards

(77 cards)