Chapter 3: Multiple Regression Panel Data Flashcards
1
Q
Linear Regression
A
- Investigate the relationship between two variables
- Does blood pressure relate to age?
- Does income relate to education?
- Do sales relate to years of experience?
- Regressions identify relationships between dependent and independent variables
- Is there an association between the two variables
- Estimation of impact of an independent variable
- Used for numerical prediction and time series forecasting
- Regression as a fairly established statistical technique:
- Sir Francis Galton (1822-1911) studied the relationship between a father’s height and the son’s height
2
Q
The Simple Linear Regression Model
A
- Linear regression is a statistical tool for numerical predictions The first order linear model
- y = beta0 + beta1 + epsiolon
- y = dependent variable
- x = independent variable
- beta0 = y-intercept
- beta1 = slope of the line
- epsilon = error variable (residual)
3
Q
Estimating the Coefficients
A
- Coefficients are random variables
- (Ordinary least squares) estimates are determined by
- drawing a sample from the population of interest,
- calculating sample statistics, and
- producing a straight line that cuts into the data.
4
Q
The Multiple Linear Regression Model
A
- A 𝑝 variable regression model can be expressed as a series of equations
- Equations condensed into a matrix form, gives the a general linear model
- b coefficients are known as partial regression coefficients
- 𝑋1,𝑋2, for example,
- 𝑋1=‘years of experience’
- 𝑋2=‘age’
- 𝑦=‘salary’
- Estimated equation:
*
5
Q
Matrix Notation
A
6
Q
OLS Estimation
A
- Sample-based counter part to population regression model:
- y = Xβ + ε
- y = Xβ + e
- OLS requires choosing values of the estimated coefficients, such that error sum-of-squares (SSE) is as small as possible for the sample.
- SSE = eTe = (y -Xβ)T (y -Xβ)
- Need to differentiate with respect to the unknown coefficients.
7
Q
Selected Statistics
A
- Adjusted R2 (More parameters < error rate)
- It represents the proportion of variability of Y explained by the X’s. R2 is adjusted so that models with different number of variables can be compared.
- The F-test (If any parameter has influence)
- Significant F indicates a linear relationship between Y and at least one of the X’s.
- The t-test of each partial regression coefficient 1 (If parameter X has influence )
- Significant t indicates that the variable in question influences the response variable while controlling for other explanatory variables.
8
Q
Gauss-Markov Assumptions
A
The OLS estimator is the best linear unbiased estimator (BLUE), iff
- 1) There is a linear relationship among the predictors and 𝑦
- 2) Expected value of the residual vector is 0
- 3) There is no correlation between the 𝑖th and 𝑗th residual terms
- 4) The residuals follow a Gauss distribution and exhibit constant variance (homoscedasticity)
- 5) The covariance between the 𝑋’s and residual terms is 0
- Usually satisfied if the predictor variables are fixed and non-stochastic
- 6) No multicollinearity
- 6) Assumption of no multicollinearity
- The rank of the data matrix, 𝑋 is 𝑝, the number of columns
- p < n, the number of observations
- No exact linear relationship among X variables
- r(x) = p + 1
- p < n, the number of observations
- A basic check of multicollinearity is to calculate the correlation coefficient for each pair of predictor variables
- Large correlations (both positive and negative) indicate problems.
- One interpretation of large is: greater than the correlations between the predictors and the response
- The rank of the data matrix, 𝑋 is 𝑝, the number of columns
9
Q
Heteroscedasticity
A
- When the requirement of a constant variance is violated we have heteroscedasticity.
- Breusch-Pagan test or White test are used to check for heteroscedasticity.
10
Q
Homoscedasticity
A
- When the requirement of a constant variance is not violated we have homoscedasticity.
11
Q
Outliers
A
- An outlier is an observation that is unusually small or large. Several possibilities need to be investigated when an outlier is observed:
- There was an error in recording the value.
- The point does not belong in the sample.
- The observation is valid.
- Identify outliers from the scatter diagram.
- There are also methods for “robust” regression.
12
Q
Modeling: Nominal Predictor Variables
A
- Binary variables are coded 0, 1.
- For example a variable 𝑋1 (Gender) is coded male = 0, female = 1.
- Then in the regression equation 𝑌 = 𝛽𝟎 + 𝛽𝟏𝑋𝟏 + 𝛽𝟐𝑋𝟐 when 𝑥𝟏𝟏 = 1 the value of 𝑌 indicates what is obtained for female gender;
- when 𝑥𝟏𝟏 = 0 the value of 𝑌 indicates what is obtained for males.
- If we have a nominal variable with more than two categories one can create a number of new dummy (also called indicator) binary variables
13
Q
Model Comparisons
A
- Hundreds of predictor variables – what to do?
- Too many “irrelevant” attributes can negatively impact the performance of a model
- Our interest is in parsimonious modeling
- We seek a minimum set of 𝑋 variables to predict variation in 𝑌 response variable.
- Goal is to reduce the number of predictor variables to arrive at a more parsimonious description of the data.
- Does leaving out one of the 𝛽’s significantly diminish the variance explained by the model.
- Compare a Saturated (full) to an Unsaturated model
- Note there are many possible Unsaturated models.
14
Q
„Stepwise“ Linear Regression
A
- Considers all possible simple regressions.
- Starts with the variable with largest correlation with 𝑌
- Considers next the variable that makes the largest contribution to the regression’s sum of squares
- Tests significance of the contribution
- Checks that individual contributions of variables already in the equation are still significant
- Repeats until all possible additions are non-significant and all possible deletions are significant
- We will discuss attribute selection later in the course in more detail.
15
Q
Applications of Linear Regressions to Time Series Data
A
Average hours worked per week by manufacturing workers: