Chapter 3: Multiple Regression Panel Data Flashcards Preview

Business Analytics > Chapter 3: Multiple Regression Panel Data > Flashcards

Flashcards in Chapter 3: Multiple Regression Panel Data Deck (29):

Linear Regression 

  • Investigate the relationship between two variables
    • Does blood pressure relate to age?
    • Does income relate to education?
    • Do sales relate to years of experience?
  • Regressions identify relationships between dependent and independent variables
    • Is there an association between the two variables
    • Estimation of impact of an independent variable
    • Used for numerical prediction and time series forecasting
  • Regression as a fairly established statistical technique:
    • Sir Francis Galton (1822-1911) studied the relationship between a father’s height and the son’s height 


The Simple Linear Regression Model 

  • Linear regression is a statistical tool for numerical predictions The first order linear model
  • y = beta0 + beta1 + epsiolon
    • y = dependent variable
    • x = independent variable
    • beta0 = y-intercept
    • beta1 = slope of the line
    • epsilon = error variable (residual) 


Estimating the Coefficients 

  • Coefficients are random variables
  • (Ordinary least squares) estimates are determined by
    • drawing a sample from the population of interest,
    • calculating sample statistics, and
    • producing a straight line that cuts into the data. 


The Multiple Linear Regression Model 

  • A 𝑝 variable regression model can be expressed as a series of equations
  • Equations condensed into a matrix form, gives the a general linear model
  • b coefficients are known as partial regression coefficients
  • 𝑋1,𝑋2, for example,
    • 𝑋1=‘years of experience’
    • 𝑋2=‘age’
    • 𝑦=‘salary’
  • Estimated equation: 


Matrix Notation 


OLS Estimation 

  • Sample-based counter part to population regression model:
    • y = Xβ + ε
    • y = Xβ + e
  • OLS requires choosing values of the estimated coefficients, such that error sum-of-squares (SSE) is as small as possible for the sample.
    • SSE = eTe =  (y -Xβ)T (y -Xβ)
  • Need to differentiate with respect to the unknown coefficients. 


Selected Statistics 

  • Adjusted R2 (More parameters < error rate)
    • It represents the proportion of variability of Y explained by the X’s. R2 is adjusted so that models with different number of variables can be compared.
  • The F-test (If any parameter has influence)
    • Significant F indicates a linear relationship between Y and at least one of the X’s.
  • The t-test of each partial regression coefficient 1 (If parameter X has influence )
    • Significant t indicates that the variable in question influences the response variable while controlling for other explanatory variables. 


Gauss-Markov Assumptions

The OLS estimator is the best linear unbiased estimator (BLUE), iff

  • 1) There is a linear relationship among the predictors and 𝑦
  • 2) Expected value of the residual vector is 0
  • 3) There is no correlation between the 𝑖th and 𝑗th residual terms
  • 4) The residuals follow a Gauss distribution and exhibit constant variance (homoscedasticity)
  • 5) The covariance between the 𝑋’s and residual terms is 0
    • Usually satisfied if the predictor variables are fixed and non-stochastic
  • 6) No multicollinearity
  • 6) Assumption of no multicollinearity

    • The rank of the data matrix, 𝑋 is 𝑝, the number of columns

      • p < n, the number of observations

        • No exact linear relationship among X variables

        • r(x) = p + 1

    • A basic check of multicollinearity is to calculate the correlation coefficient for each pair of predictor variables

      • Large correlations (both positive and negative) indicate problems.

      • One interpretation of large is: greater than the correlations between the predictors and the response 



  • When the requirement of a constant variance is violated we have heteroscedasticity.

  • Breusch-Pagan test or White test are used to check for heteroscedasticity. 



  • When the requirement of a constant variance is not violated we have homoscedasticity. 



  • An outlier is an observation that is unusually small or large. Several possibilities need to be investigated when an outlier is observed:
    • There was an error in recording the value.
    • The point does not belong in the sample.
    • The observation is valid.
  • Identify outliers from the scatter diagram.
  • There are also methods for “robust” regression. 


Modeling: Nominal Predictor Variables 

  • Binary variables are coded 0, 1.
  • For example a variable 𝑋1 (Gender) is coded male = 0, female = 1.
    • Then in the regression equation 𝑌 = 𝛽𝟎 + 𝛽𝟏𝑋𝟏 + 𝛽𝟐𝑋𝟐 when 𝑥𝟏𝟏 = 1 the value of 𝑌 indicates what is obtained for female gender;
    • when 𝑥𝟏𝟏 = 0 the value of 𝑌 indicates what is obtained for males.
  • If we have a nominal variable with more than two categories one can create a number of new dummy (also called indicator) binary variables 


Model Comparisons 

  • Hundreds of predictor variables – what to do?
    • Too many “irrelevant” attributes can negatively impact the performance of a model
  • Our interest is in parsimonious modeling
    • We seek a minimum set of 𝑋 variables to predict variation in 𝑌 response variable.
    • Goal is to reduce the number of predictor variables to arrive at a more parsimonious description of the data.
  • Does leaving out one of the 𝛽’s significantly diminish the variance explained by the model.
    • Compare a Saturated (full) to an Unsaturated model
    • Note there are many possible Unsaturated models. 


„Stepwise“ Linear Regression 

  • Considers all possible simple regressions.
    • Starts with the variable with largest correlation with 𝑌
  • Considers next the variable that makes the largest contribution to the regression’s sum of squares
    • Tests significance of the contribution
    • Checks that individual contributions of variables already in the equation are still significant
  • Repeats until all possible additions are non-significant and all possible deletions are significant
  • We will discuss attribute selection later in the course in more detail. 


Applications of Linear Regressions to Time Series Data 

Average hours worked per week by manufacturing workers: 




  • 3rd Gauss-Markov assumption: There is no correlation between the 𝑖th and 𝑗th residual terms
  •  Examining the residuals over time, no pattern should be observed if the errors are independent.
  •  Autocorrelation can be detected by graphing the residuals against time, or Durbin-Watson statistic 

... the statistical procedures used for regression may no longer be applicable


  • use Durbin-Watson statistic to test for first order autocorrelation. D-W takes values within [0, 4]. For no serial correlation, a value close to 2 (e.g., 1.5-2.5) is expected. 


Modeling Seasonality 

  • This uses a regression to estimate both the trend and additive seasonal indexes

    • 1 Create dummy variables which indicate the season

    • 2 Regress on time and the seasonal variables

    • 3 Use the multiple regression model to forecast

  • For any season, e.g. season 1, create a column with 1 for time periods which are season 1, and zero for other time

    • periods (only 𝑠 – 1 dummy variables are required) 

  • The model which is fitted (assuming quarterly data) is

    • 𝑦𝑡 = 𝛽0 + 𝛽1𝑡 + 𝛽2𝑄1 +𝛽3𝑄2 +𝛽4𝑄3

  • This is an additive model

  • Allows to test for seasonality 


Dummy Variables 


Testing for Structural Breaks 

  • There are various tests allowing to compare regression models such as encompassing tests or the Chow test
    • The Chow test is an econometric test of whether the coefficients in two linear regressions on different data sets are equal, i.e., it tests for structural breaks.
  • The null hypothesis of the Chow test asserts that the coefficients of both models 1 and 2 are the same than those of the combined model 𝐶.
    • 𝑛 is the number of observations in a group and 𝑝 the number of parameters.
    • The test follows an 𝐹-distribution with 𝑝 parameters and 𝑛1 + 𝑛2 − 2𝑝 degrees of freedom. 


Panel Data vs. Cross-Section Data 

  • Cross-section data refers to data observing many subjects (such as individuals, firms or countries/regions) at the same point of time, or without regard to differences in time. (Many subjects / independent)
  • Panel data sets have several advantages over cross-section data sets:
    • They may make it possible to overcome a problem of bias caused by unobserved heterogeneity. (Observe Objects multiple times)


Analyzing Panel Data 

  • A panel data set, or longitudinal data set, is one where there are repeated observations on the same units.
  • The units may be individuals, households, enterprises, countries, or any set of entities that remain stable through time.
    • The US National Longitudinal Survey of Youth (NLSY) is an example. Since 1994 respondents have been interviewed every two years.
    • Also sales data regularly contain observations about the same individuals.
  • A balanced panel is one where every unit is surveyed in every time period. The NLSY is unbalanced because some individuals have not been interviewed in some years. Some could not be located, some refused, and a few have died. 


The Panel Data Structure 


Omitted Variable Bias in Panel Data 

  • Endogeneity is given when an independent variable is correlated with the error term and the covariance is not null
    • In contrast, the Gauss-Markov assumptions state that the error term is uncorrelated with the regressors.
    • A reasons for endogeneity might be that relevant variables are omitted from the model (underfitting). Such an omitted variable is also called confounding variable.
      • For example, enthusiasm or willingness to take risks of an individual in the panel.
  • Various techniques have been developed to address endogeneity including fixed-effects models, propensity score matching, or instrument variables. 


Treatment of Individual Effects 

  • There are two main options for the treatment of individual effects in panel data:
    • Fixed effects – assume λi are constants (there is endogeneity) (for each individual)
    • Random effects – assume λi are drawn independently from some probability distribution
  • Statistical tests such as the Hausman (Do random effects exist?) test can help decide on one or the other.
  • Specific packages in R are available for random, and mixed effects models, which combine both (e.g., plm).
    • Fixed effects can be modeled by including a dummy for each individual. 


The Fixed Effect Model 

Treat λi as a constant for each individual:



Fixed Effects Models (Linear) (Graphic)



Random Effects Model 

  • The random effects assumption (made in a random effects model) is that the individual specific effects are uncorrelated with the independent variables. The unobserved heterogeneity is orthogonal to the other covariates.

  • The fixed effect assumption is that the individual specific effect is correlated with the independent variables. 


Other Phenomena Caused by Omitted Variables 

Consider the acceptance rates for the following groups of men and women who applied to college. 

A higher percentage of men were accepted: Is there evidence of discrimination? 


  • Omitted variable: Applications were split between the Computer Science (240) and the School of Management (320).
  • Within each school a higher percentage of women were accepted than men.
    • There is no discrimination against women.
  • This is an example of Simpson‘s Paradoxon.
    • When the omitted variable (Type of School) is ignored the data seem to suggest discrimination against women.
    • However, when the type of school is considered, the association is reversed and suggests discrimination against men. 


Simpson‘s Paradoxon 

  • When studying the relationship between two variables, there may exist a confounding variable that creates a reversal in the direction of the relationship when the lurking variable is ignored as opposed to the direction of the relationship when the lurking variable is considered.

  • The confounding (or lurking) variable creates subgroups, and failure to take these subgroups into consideration can lead to misleading conclusions regarding the association between the two variables 

An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.