Investigate the relationship between two variables Does blood pressure relate to age? Does income relate to education? Do sales relate to years of experience? Regressions identify relationships between dependent and independent variables Is there an association between the two variables Estimation of impact of an independent variable Used for numerical prediction and time series forecasting Regression as a fairly established statistical technique: Sir Francis Galton (1822-1911) studied the relationship between a father’s height and the son’s height

Adjusted R2 (More parameters \< error rate) It represents the proportion of variability of Y explained by the X’s. R2 is adjusted so that models with different number of variables can be compared. The F-test (If any parameter has influence) Significant F indicates a linear relationship between Y and at least one of the X’s. The t-test of each partial regression coefficient 1 (If parameter X has influence ) Significant t indicates that the variable in question influences the response variable while controlling for other explanatory variables.

Hundreds of predictor variables – what to do? Too many “irrelevant” attributes can negatively impact the performance of a model Our interest is in parsimonious modeling We seek a minimum set of 𝑋 variables to predict variation in 𝑌 response variable. Goal is to reduce the number of predictor variables to arrive at a more parsimonious description of the data. Does leaving out one of the 𝛽’s significantly diminish the variance explained by the model. Compare a Saturated (full) to an Unsaturated model Note there are many possible Unsaturated models.

Chapter 3: Multiple Regression Panel Data Flashcards by Andreas Hein

Linear Regression

Investigate the relationship between two variables
- Does blood pressure relate to age?
- Does income relate to education?
- Do sales relate to years of experience?
Regressions identify relationships between dependent and independent variables
- Is there an association between the two variables
- Estimation of impact of an independent variable
- Used for numerical prediction and time series forecasting
Regression as a fairly established statistical technique:
- Sir Francis Galton (1822-1911) studied the relationship between a father’s height and the son’s height

How well did you know this?

Not at all

Perfectly

The Simple Linear Regression Model

Linear regression is a statistical tool for numerical predictions The first order linear model
y = beta₀ + beta₁ + epsiolon
- y = dependent variable
- x = independent variable
- beta₀ = y-intercept
- beta₁ = slope of the line
- epsilon = error variable (residual)

How well did you know this?

Not at all

Perfectly

Estimating the Coefficients

Coefficients are random variables
(Ordinary least squares) estimates are determined by
- drawing a sample from the population of interest,
- calculating sample statistics, and
- producing a straight line that cuts into the data.

How well did you know this?

Not at all

Perfectly

The Multiple Linear Regression Model

A 𝑝 variable regression model can be expressed as a series of equations
Equations condensed into a matrix form, gives the a general linear model
b coefficients are known as partial regression coefficients
𝑋1,𝑋2, for example,
- 𝑋1=‘years of experience’
- 𝑋2=‘age’
- 𝑦=‘salary’
Estimated equation:
*

How well did you know this?

Not at all

Perfectly

Matrix Notation

How well did you know this?

Not at all

Perfectly

OLS Estimation

Sample-based counter part to population regression model:
- y = Xβ + ε
- y = Xβ + e
OLS requires choosing values of the estimated coefficients, such that error sum-of-squares (SSE) is as small as possible for the sample.
- SSE = e^Te = (y -Xβ)^T (y -Xβ)
Need to differentiate with respect to the unknown coefficients.

How well did you know this?

Not at all

Perfectly

Selected Statistics

Adjusted R2 (More parameters < error rate)
- It represents the proportion of variability of Y explained by the X’s. R2 is adjusted so that models with different number of variables can be compared.
The F-test (If any parameter has influence)
- Significant F indicates a linear relationship between Y and at least one of the X’s.
The t-test of each partial regression coefficient 1 (If parameter X has influence )
- Significant t indicates that the variable in question influences the response variable while controlling for other explanatory variables.

How well did you know this?

Not at all

Perfectly

Gauss-Markov Assumptions

The OLS estimator is the best linear unbiased estimator (BLUE), iff

1) There is a linear relationship among the predictors and 𝑦
2) Expected value of the residual vector is 0
3) There is no correlation between the 𝑖^th and 𝑗^th residual terms
4) The residuals follow a Gauss distribution and exhibit constant variance (homoscedasticity)
5) The covariance between the 𝑋’s and residual terms is 0
- Usually satisfied if the predictor variables are fixed and non-stochastic
6) No multicollinearity
6) Assumption of no multicollinearity
- The rank of the data matrix, 𝑋 is 𝑝, the number of columns
  - p < n, the number of observations
    - No exact linear relationship among X variables
    - r(x) = p + 1
- A basic check of multicollinearity is to calculate the correlation coefficient for each pair of predictor variables
  - Large correlations (both positive and negative) indicate problems.
  - One interpretation of large is: greater than the correlations between the predictors and the response

How well did you know this?

Not at all

Perfectly

Heteroscedasticity

When the requirement of a constant variance is violated we have heteroscedasticity.
Breusch-Pagan test or White test are used to check for heteroscedasticity.

How well did you know this?

Not at all

Perfectly

Homoscedasticity

When the requirement of a constant variance is not violated we have homoscedasticity.

How well did you know this?

Not at all

Perfectly

Outliers

An outlier is an observation that is unusually small or large. Several possibilities need to be investigated when an outlier is observed:
- There was an error in recording the value.
- The point does not belong in the sample.
- The observation is valid.
Identify outliers from the scatter diagram.
There are also methods for “robust” regression.

How well did you know this?

Not at all

Perfectly

Modeling: Nominal Predictor Variables

Binary variables are coded 0, 1.
For example a variable 𝑋1 (Gender) is coded male = 0, female = 1.
- Then in the regression equation 𝑌 = 𝛽_𝟎 + 𝛽_𝟏𝑋_𝟏 + 𝛽_𝟐𝑋_𝟐 when 𝑥_𝟏𝟏 = 1 the value of 𝑌 indicates what is obtained for female gender;
- when 𝑥_𝟏𝟏 = 0 the value of 𝑌 indicates what is obtained for males.
If we have a nominal variable with more than two categories one can create a number of new dummy (also called indicator) binary variables

How well did you know this?

Not at all

Perfectly

Model Comparisons

Hundreds of predictor variables – what to do?
- Too many “irrelevant” attributes can negatively impact the performance of a model
Our interest is in parsimonious modeling
- We seek a minimum set of 𝑋 variables to predict variation in 𝑌 response variable.
- Goal is to reduce the number of predictor variables to arrive at a more parsimonious description of the data.
Does leaving out one of the 𝛽’s significantly diminish the variance explained by the model.
- Compare a Saturated (full) to an Unsaturated model
- Note there are many possible Unsaturated models.

How well did you know this?

Not at all

Perfectly

„Stepwise“ Linear Regression

Considers all possible simple regressions.
- Starts with the variable with largest correlation with 𝑌
Considers next the variable that makes the largest contribution to the regression’s sum of squares
- Tests significance of the contribution
- Checks that individual contributions of variables already in the equation are still significant
Repeats until all possible additions are non-significant and all possible deletions are significant
We will discuss attribute selection later in the course in more detail.

How well did you know this?

Not at all

Perfectly

Applications of Linear Regressions to Time Series Data

Average hours worked per week by manufacturing workers:

How well did you know this?

Not at all

Perfectly

Autocorrelation

Study These Flashcards

3rd Gauss-Markov assumption: There is no correlation between the 𝑖th and 𝑗th residual terms
Examining the residuals over time, no pattern should be observed if the errors are independent.
Autocorrelation can be detected by graphing the residuals against time, or Durbin-Watson statistic

… the statistical procedures used for regression may no longer be applicable

Detection

use Durbin-Watson statistic to test for first order autocorrelation. D-W takes values within [0, 4]. For no serial correlation, a value close to 2 (e.g., 1.5-2.5) is expected.

Modeling Seasonality

Study These Flashcards

This uses a regression to estimate both the trend and additive seasonal indexes
- 1 Create dummy variables which indicate the season
- 2 Regress on time and the seasonal variables
- 3 Use the multiple regression model to forecast
For any season, e.g. season 1, create a column with 1 for time periods which are season 1, and zero for other time
- periods (only 𝑠 – 1 dummy variables are required)
The model which is fitted (assuming quarterly data) is
- 𝑦𝑡 = 𝛽₀ + 𝛽₁𝑡 + 𝛽₂𝑄₁ +𝛽₃𝑄₂ +𝛽₄𝑄₃
This is an additive model
Allows to test for seasonality

Dummy Variables

Study These Flashcards

Testing for Structural Breaks

Study These Flashcards

There are various tests allowing to compare regression models such as encompassing tests or the Chow test
- The Chow test is an econometric test of whether the coefficients in two linear regressions on different data sets are equal, i.e., it tests for structural breaks.
The null hypothesis of the Chow test asserts that the coefficients of both models 1 and 2 are the same than those of the combined model 𝐶.
- 𝑛 is the number of observations in a group and 𝑝 the number of parameters.
- The test follows an 𝐹-distribution with 𝑝 parameters and 𝑛₁ + 𝑛₂ − 2𝑝 degrees of freedom.

Panel Data vs. Cross-Section Data

Study These Flashcards

Cross-section data refers to data observing many subjects (such as individuals, firms or countries/regions) at the same point of time, or without regard to differences in time. (Many subjects / independent)
Panel data sets have several advantages over cross-section data sets:
- They may make it possible to overcome a problem of bias caused by unobserved heterogeneity. (Observe Objects multiple times)

Analyzing Panel Data

Study These Flashcards

A panel data set, or longitudinal data set, is one where there are repeated observations on the same units.
The units may be individuals, households, enterprises, countries, or any set of entities that remain stable through time.
- The US National Longitudinal Survey of Youth (NLSY) is an example. Since 1994 respondents have been interviewed every two years.
- Also sales data regularly contain observations about the same individuals.
A balanced panel is one where every unit is surveyed in every time period. The NLSY is unbalanced because some individuals have not been interviewed in some years. Some could not be located, some refused, and a few have died.

The Panel Data Structure

Study These Flashcards

Omitted Variable Bias in Panel Data

Study These Flashcards

Endogeneity is given when an independent variable is correlated with the error term and the covariance is not null
- In contrast, the Gauss-Markov assumptions state that the error term is uncorrelated with the regressors.
- A reasons for endogeneity might be that relevant variables are omitted from the model (underfitting). Such an omitted variable is also called confounding variable.
  - For example, enthusiasm or willingness to take risks of an individual in the panel.
Various techniques have been developed to address endogeneity including fixed-effects models, propensity score matching, or instrument variables.

Treatment of Individual Effects

Study These Flashcards

There are two main options for the treatment of individual effects in panel data:
- Fixed effects – assume λ_i are constants (there is endogeneity) (for each individual)
- Random effects – assume λ_i are drawn independently from some probability distribution
Statistical tests such as the Hausman (Do random effects exist?) test can help decide on one or the other.
Specific packages in R are available for random, and mixed effects models, which combine both (e.g., plm).
- Fixed effects can be modeled by including a dummy for each individual.

The Fixed Effect Model

Treat λ_i as a constant for each individual:

Fixed Effects Models (Linear) (Graphic)

Graphic

Random Effects Model

* The random effects assumption (made in a random effects model) is that the individual specific effects are uncorrelated with the independent variables. The unobserved heterogeneity is orthogonal to the other covariates. * The fixed effect assumption is that the individual specific effect is correlated with the independent variables.

Other Phenomena Caused by Omitted Variables

Consider the acceptance rates for the following groups of men and women who applied to college. A higher percentage of men were accepted: Is there evidence of discrimination? Explanations? * Omitted variable: Applications were split between the Computer Science (240) and the School of Management (320). * Within each school a higher percentage of women were accepted than men. * There is no discrimination against women. * This is an example of Simpson‘s Paradoxon. * When the omitted variable (Type of School) is ignored the data seem to suggest discrimination against women. * However, when the type of school is considered, the association is reversed and suggests discrimination against men.

Simpson‘s Paradoxon

* When studying the relationship between two variables, there may exist a confounding variable that creates a reversal in the direction of the relationship when the lurking variable is ignored as opposed to the direction of the relationship when the lurking variable is considered. * The confounding (or lurking) variable creates subgroups, and failure to take these subgroups into consideration can lead to misleading conclusions regarding the association between the two variables An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.

Chapter 3: Multiple Regression Panel Data Flashcards

(29 cards)