Linear Regression
 Investigate the relationship between two variables
 Does blood pressure relate to age?
 Does income relate to education?
 Do sales relate to years of experience?
 Regressions identify relationships between dependent and independent variables
 Is there an association between the two variables
 Estimation of impact of an independent variable
 Used for numerical prediction and time series forecasting
 Regression as a fairly established statistical technique:
 Sir Francis Galton (18221911) studied the relationship between a father’s height and the son’s height
 Does blood pressure relate to age?
 Does income relate to education?
 Do sales relate to years of experience?
 Is there an association between the two variables
 Estimation of impact of an independent variable
 Used for numerical prediction and time series forecasting
 Sir Francis Galton (18221911) studied the relationship between a father’s height and the son’s height
The Simple Linear Regression Model
 Linear regression is a statistical tool for numerical predictions The first order linear model
 y = beta_{0} + beta_{1} + epsiolon
 y = dependent variable
 x = independent variable
 beta_{0} = yintercept
 beta_{1} = slope of the line
 epsilon = error variable (residual)
 y = dependent variable
 x = independent variable
 beta_{0} = yintercept
 beta_{1} = slope of the line
 epsilon = error variable (residual)
Estimating the Coefficients
 Coefficients are random variables
 (Ordinary least squares) estimates are determined by
 drawing a sample from the population of interest,
 calculating sample statistics, and
 producing a straight line that cuts into the data.
 drawing a sample from the population of interest,
 calculating sample statistics, and
 producing a straight line that cuts into the data.
The Multiple Linear Regression Model
 A 𝑝 variable regression model can be expressed as a series of equations
 Equations condensed into a matrix form, gives the a general linear model
 b coefficients are known as partial regression coefficients
 𝑋1,𝑋2, for example,
 𝑋1=‘years of experience’
 𝑋2=‘age’
 𝑦=‘salary’
 Estimated equation:

 𝑋1=‘years of experience’
 𝑋2=‘age’
 𝑦=‘salary’
Matrix Notation
OLS Estimation
 Samplebased counter part to population regression model:
 y = Xβ + ε
 y = Xβ + e
 OLS requires choosing values of the estimated coefficients, such that error sumofsquares (SSE) is as small as possible for the sample.
 SSE = e^{T}e = (y Xβ)^{T} (y Xβ)
 Need to differentiate with respect to the unknown coefficients.
 y = Xβ + ε
 y = Xβ + e
 SSE = e^{T}e = (y Xβ)^{T} (y Xβ)
Selected Statistics
 Adjusted R2 (More parameters < error rate)
 It represents the proportion of variability of Y explained by the X’s. R2 is adjusted so that models with different number of variables can be compared.
 The Ftest (If any parameter has influence)
 Significant F indicates a linear relationship between Y and at least one of the X’s.
 The ttest of each partial regression coefficient 1 (If parameter X has influence )
 Significant t indicates that the variable in question influences the response variable while controlling for other explanatory variables.
 It represents the proportion of variability of Y explained by the X’s. R2 is adjusted so that models with different number of variables can be compared.
 Significant F indicates a linear relationship between Y and at least one of the X’s.
 Significant t indicates that the variable in question influences the response variable while controlling for other explanatory variables.
GaussMarkov Assumptions
The OLS estimator is the best linear unbiased estimator (BLUE), iff
 1) There is a linear relationship among the predictors and 𝑦
 2) Expected value of the residual vector is 0
 3) There is no correlation between the 𝑖^{th} and 𝑗^{th} residual terms
 4) The residuals follow a Gauss distribution and exhibit constant variance (homoscedasticity)
 5) The covariance between the 𝑋’s and residual terms is 0
 Usually satisfied if the predictor variables are fixed and nonstochastic
 6) No multicollinearity

6) Assumption of no multicollinearity

The rank of the data matrix, 𝑋 is 𝑝, the number of columns

p < n, the number of observations

No exact linear relationship among X variables

r(x) = p + 1



A basic check of multicollinearity is to calculate the correlation coefficient for each pair of predictor variables

Large correlations (both positive and negative) indicate problems.

One interpretation of large is: greater than the correlations between the predictors and the response


Heteroscedasticity

When the requirement of a constant variance is violated we have heteroscedasticity.

BreuschPagan test or White test are used to check for heteroscedasticity.
When the requirement of a constant variance is violated we have heteroscedasticity.
BreuschPagan test or White test are used to check for heteroscedasticity.
Homoscedasticity

When the requirement of a constant variance is not violated we have homoscedasticity.
When the requirement of a constant variance is not violated we have homoscedasticity.
Outliers
 An outlier is an observation that is unusually small or large. Several possibilities need to be investigated when an outlier is observed:
 There was an error in recording the value.
 The point does not belong in the sample.
 The observation is valid.
 Identify outliers from the scatter diagram.
 There are also methods for “robust” regression.
 There was an error in recording the value.
 The point does not belong in the sample.
 The observation is valid.
Modeling: Nominal Predictor Variables
 Binary variables are coded 0, 1.
 For example a variable 𝑋1 (Gender) is coded male = 0, female = 1.
 Then in the regression equation 𝑌 = 𝛽_{𝟎} + 𝛽_{𝟏}𝑋_{𝟏} + 𝛽_{𝟐}𝑋_{𝟐} when 𝑥_{𝟏𝟏} = 1 the value of 𝑌 indicates what is obtained for female gender;
 when 𝑥_{𝟏𝟏} = 0 the value of 𝑌 indicates what is obtained for males.
 If we have a nominal variable with more than two categories one can create a number of new dummy (also called indicator) binary variables
 Then in the regression equation 𝑌 = 𝛽_{𝟎} + 𝛽_{𝟏}𝑋_{𝟏} + 𝛽_{𝟐}𝑋_{𝟐} when 𝑥_{𝟏𝟏} = 1 the value of 𝑌 indicates what is obtained for female gender;
 when 𝑥_{𝟏𝟏} = 0 the value of 𝑌 indicates what is obtained for males.
Model Comparisons
 Hundreds of predictor variables – what to do?
 Too many “irrelevant” attributes can negatively impact the performance of a model
 Our interest is in parsimonious modeling
 We seek a minimum set of 𝑋 variables to predict variation in 𝑌 response variable.
 Goal is to reduce the number of predictor variables to arrive at a more parsimonious description of the data.
 Does leaving out one of the 𝛽’s significantly diminish the variance explained by the model.
 Compare a Saturated (full) to an Unsaturated model
 Note there are many possible Unsaturated models.
 Too many “irrelevant” attributes can negatively impact the performance of a model
 We seek a minimum set of 𝑋 variables to predict variation in 𝑌 response variable.
 Goal is to reduce the number of predictor variables to arrive at a more parsimonious description of the data.
 Compare a Saturated (full) to an Unsaturated model
 Note there are many possible Unsaturated models.
„Stepwise“ Linear Regression
 Considers all possible simple regressions.
 Starts with the variable with largest correlation with 𝑌
 Considers next the variable that makes the largest contribution to the regression’s sum of squares
 Tests significance of the contribution
 Checks that individual contributions of variables already in the equation are still significant
 Repeats until all possible additions are nonsignificant and all possible deletions are significant
 We will discuss attribute selection later in the course in more detail.
 Starts with the variable with largest correlation with 𝑌
 Tests significance of the contribution
 Checks that individual contributions of variables already in the equation are still significant
Applications of Linear Regressions to Time Series Data
Average hours worked per week by manufacturing workers:
Autocorrelation
 3rd GaussMarkov assumption: There is no correlation between the 𝑖th and 𝑗th residual terms
 Examining the residuals over time, no pattern should be observed if the errors are independent.
 Autocorrelation can be detected by graphing the residuals against time, or DurbinWatson statistic
... the statistical procedures used for regression may no longer be applicable
Detection
 use DurbinWatson statistic to test for first order autocorrelation. DW takes values within [0, 4]. For no serial correlation, a value close to 2 (e.g., 1.52.5) is expected.
Modeling Seasonality

This uses a regression to estimate both the trend and additive seasonal indexes

1 Create dummy variables which indicate the season

2 Regress on time and the seasonal variables

3 Use the multiple regression model to forecast

For any season, e.g. season 1, create a column with 1 for time periods which are season 1, and zero for other time

periods (only 𝑠 – 1 dummy variables are required)

The model which is fitted (assuming quarterly data) is

𝑦𝑡 = 𝛽_{0} + 𝛽_{1}𝑡 + 𝛽_{2}𝑄_{1} +𝛽_{3}𝑄_{2} +𝛽_{4}𝑄_{3}

This is an additive model

Allows to test for seasonality
This uses a regression to estimate both the trend and additive seasonal indexes

1 Create dummy variables which indicate the season

2 Regress on time and the seasonal variables

3 Use the multiple regression model to forecast
For any season, e.g. season 1, create a column with 1 for time periods which are season 1, and zero for other time

periods (only 𝑠 – 1 dummy variables are required)
The model which is fitted (assuming quarterly data) is

𝑦𝑡 = 𝛽_{0} + 𝛽_{1}𝑡 + 𝛽_{2}𝑄_{1} +𝛽_{3}𝑄_{2} +𝛽_{4}𝑄_{3}
This is an additive model
Allows to test for seasonality
Dummy Variables
Testing for Structural Breaks
 There are various tests allowing to compare regression models such as encompassing tests or the Chow test
 The Chow test is an econometric test of whether the coefficients in two linear regressions on different data sets are equal, i.e., it tests for structural breaks.
 The null hypothesis of the Chow test asserts that the coefficients of both models 1 and 2 are the same than those of the combined model 𝐶.
 𝑛 is the number of observations in a group and 𝑝 the number of parameters.
 The test follows an 𝐹distribution with 𝑝 parameters and 𝑛_{1} + 𝑛_{2} − 2𝑝 degrees of freedom.
 The Chow test is an econometric test of whether the coefficients in two linear regressions on different data sets are equal, i.e., it tests for structural breaks.
 𝑛 is the number of observations in a group and 𝑝 the number of parameters.
 The test follows an 𝐹distribution with 𝑝 parameters and 𝑛_{1} + 𝑛_{2} − 2𝑝 degrees of freedom.
Panel Data vs. CrossSection Data
 Crosssection data refers to data observing many subjects (such as individuals, firms or countries/regions) at the same point of time, or without regard to differences in time. (Many subjects / independent)
 Panel data sets have several advantages over crosssection data sets:
 They may make it possible to overcome a problem of bias caused by unobserved heterogeneity. (Observe Objects multiple times)
 They may make it possible to overcome a problem of bias caused by unobserved heterogeneity. (Observe Objects multiple times)
Analyzing Panel Data
 A panel data set, or longitudinal data set, is one where there are repeated observations on the same units.
 The units may be individuals, households, enterprises, countries, or any set of entities that remain stable through time.
 The US National Longitudinal Survey of Youth (NLSY) is an example. Since 1994 respondents have been interviewed every two years.
 Also sales data regularly contain observations about the same individuals.
 A balanced panel is one where every unit is surveyed in every time period. The NLSY is unbalanced because some individuals have not been interviewed in some years. Some could not be located, some refused, and a few have died.
 The US National Longitudinal Survey of Youth (NLSY) is an example. Since 1994 respondents have been interviewed every two years.
 Also sales data regularly contain observations about the same individuals.
The Panel Data Structure
Omitted Variable Bias in Panel Data

Endogeneity is given when an independent variable is correlated with the error term and the covariance is not null
 In contrast, the GaussMarkov assumptions state that the error term is uncorrelated with the regressors.
 A reasons for endogeneity might be that relevant variables are omitted from the model (underfitting). Such an omitted variable is also called confounding variable.
 For example, enthusiasm or willingness to take risks of an individual in the panel.
 Various techniques have been developed to address endogeneity including fixedeffects models, propensity score matching, or instrument variables.
 In contrast, the GaussMarkov assumptions state that the error term is uncorrelated with the regressors.
 A reasons for endogeneity might be that relevant variables are omitted from the model (underfitting). Such an omitted variable is also called confounding variable.
 For example, enthusiasm or willingness to take risks of an individual in the panel.
Treatment of Individual Effects
 There are two main options for the treatment of individual effects in panel data:
 Fixed effects – assume λ_{i} are constants (there is endogeneity) (for each individual)
 Random effects – assume λ_{i} are drawn independently from some probability distribution
 Statistical tests such as the Hausman (Do random effects exist?) test can help decide on one or the other.
 Specific packages in R are available for random, and mixed effects models, which combine both (e.g., plm).
 Fixed effects can be modeled by including a dummy for each individual.
 Fixed effects – assume λ_{i} are constants (there is endogeneity) (for each individual)
 Random effects – assume λ_{i} are drawn independently from some probability distribution
 Fixed effects can be modeled by including a dummy for each individual.
The Fixed Effect Model
Treat λ_{i} as a constant for each individual:
Fixed Effects Models (Linear) (Graphic)
Graphic
Random Effects Model

The random effects assumption (made in a random effects model) is that the individual specific effects are uncorrelated with the independent variables. The unobserved heterogeneity is orthogonal to the other covariates.

The fixed effect assumption is that the individual specific effect is correlated with the independent variables.
The random effects assumption (made in a random effects model) is that the individual specific effects are uncorrelated with the independent variables. The unobserved heterogeneity is orthogonal to the other covariates.
The fixed effect assumption is that the individual specific effect is correlated with the independent variables.
Other Phenomena Caused by Omitted Variables
Consider the acceptance rates for the following groups of men and women who applied to college.
A higher percentage of men were accepted: Is there evidence of discrimination?
Explanations?
 Omitted variable: Applications were split between the Computer Science (240) and the School of Management (320).
 Within each school a higher percentage of women were accepted than men.
 There is no discrimination against women.
 This is an example of Simpson‘s Paradoxon.
 When the omitted variable (Type of School) is ignored the data seem to suggest discrimination against women.
 However, when the type of school is considered, the association is reversed and suggests discrimination against men.
Simpson‘s Paradoxon

When studying the relationship between two variables, there may exist a confounding variable that creates a reversal in the direction of the relationship when the lurking variable is ignored as opposed to the direction of the relationship when the lurking variable is considered.

The confounding (or lurking) variable creates subgroups, and failure to take these subgroups into consideration can lead to misleading conclusions regarding the association between the two variables
When studying the relationship between two variables, there may exist a confounding variable that creates a reversal in the direction of the relationship when the lurking variable is ignored as opposed to the direction of the relationship when the lurking variable is considered.
The confounding (or lurking) variable creates subgroups, and failure to take these subgroups into consideration can lead to misleading conclusions regarding the association between the two variables
An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.