Chapter 11 - panel data Flashcards
(31 cards)
what is cross sectional data
data that spans different sort of categories at the same point in time, or I suppsoe if time is not a variable.
what is panel data
combination of cross sectional and time series data
name some examples of cross sectional variables
market cap
PE ratio
age of the firm
Shit that doesnt really change much dynamically.
how do we refer to panel data
“a panel of data contains both tieme series data and cross sectional data”
what is a pooled regression
pooled regression takes all observations into the same dataset, and fit models based on that. The outcome is that we assume all relationships are consistent, because something like OLS will treat it this way.
For instance, if one variable is PE ratio and SIZE, one assume that each firm has the same relationship between “return” and size and PE.
elaborate on what actually consistutes panel data
it is about the structure. Contains both time and cross sectional on the variable. Meaning, the data we are looking at has sort of 2 dimensions, where one is time and one is something else (entity).
what is cross sectional data
Cross sectional data is data across many entities, like firms or entities, AT a single point in time.
A good example is a snapshot of a population’s obesity. We’d draw a sample at some point in time, and make models trying to explain what percentage is obese etc. We could also try to predict obesity based on the cross sectional information.
However, this does not provide us any insights in the trend. We do not know whether obesity is increasing or decreasing. This is because we drew all the data from a single point in time, as it is cross dectional data.
Therefore, cross sectional data contrast with time series data.
Time series actaully refers to a single entity, and how it evolve through time. Therefore, a time series would give us answers like “how does the obesity of this person/entity evolve”.
Panel data allows us to observe how entities evolve over time and to identify patterns that generalize across the population, controlling for individual-specific traits. This makes it possible to study both within-entity dynamics and between-entity relationships.
give teh simplest setup of panel data model
y_{it} = alpha + beta x_{it} + u_{it}
So, we have multiple entities in the time series, given by the variable “i”, and we have records for these at different points in time, given by “t”.
alpha is a constant, and is same for all.
beta is a vector, one value for each parameter.
Such a model assumes that a relationship that holds for a single entity also holds for the entire population.
main limitation of the pooled regression OLS approach+
assumes that patterns remain the same through time and entiteis
why not use all as independent time series
then we dont generalize anything. A goal is to provide insights on the population as a whole
what is meant by the data in this chapter
“A panel of data” refers to the data of combined time series and cross sectional
what two primary methods are we working with+
Fixed effect method
random effect method
what is a balanced panel?
A balanced panel has the same number of time series observations for each cross sectional variable unit.
what is an unbalanced panel?
different number of time series observations for the various cross sectional entiteis
elaborate on SUR
Models each entity separately, so that each entity gets its own fitted model. However, the coefficeints are constant over time.
We assume correlated error across individuals.
We use GLS to transform the regressions based on the correlation between errors. The otucome is a regression where al lthe errors are uncorrelated.
So we still have one regression per entity, and the regressions are still time-invariant. The key now is that we have weighted in the correlation between entities.
is SUR good?
if the covariance matrix of the errors is true, then it produce the true result. However, this is unlikely in practice.
limitations of SUR
1) Then number of time series points must be at least as large as the nmber of search units. This is one limitations.
2) The covariance matrix of the errors must be computed. An entity of T observations has T errors. We have N entiteis. Therefore, we have NT error terms. To get the covariance matrix, we’d have NT x NT matrix. This can be insanely large.
elaborte on the size of the covariance matrix using SUR
the book and professor claim NTxNT because we have NT errors etc.
However, SUR assume that errors are independent through time. Therefore, we’d only need NxN matrix to get the covariance matrix.
however, if we do not assume this, we need to perform the entire coviarnace matrix to allow GLS to re.qeight properly.
briefly introduce “fixed effects method”
allows the intercept to change cross-sectionally, but not through time.
Slope estimates remain fixed throgh time and cross sections.
So, each entity gets its own intercept term.
elaborate on the fixed effects method
we take the error term, u_{it}, and decompose it into two parts:
1) mu_i
2) v_{it}
mu_i is the new intercept that is entity specific.
The error term remains the same as before, sort of. it still has the same interpretation of “encapsulating everything that is not explained about y_{it}”.
The model could then be estimated using a dummy variable apprach, where we have a dummy for each entity. This is called LSDV. Least squares dummy variable.
Such a model still keeps the variables as always, but hte intercepts are now used with dummys.
elaborate on testing the fixed-effect method for whether it is actually necessary in regards to panel data
Since the slope parameters are fixed through time and entities, the intercepts are the only thing making it different than regular OLS.
As a result, we can treat the regular OLS version as a restricted variant of the fixed-effects LSDV approach. The restriction is that all intercepts must be equal.
We can use the Chow test for this. If the test is not rejected, it measn that the parameters are not significantly different from each other, and we can suffice with a pooled regression.
how many parameters must be estiamted with the LSDV method?
n+k
number of entities + number of regressors
what can we do to avoid estimating so many pdummy variable params?
Use the within transformation
elaborate on the within transformation
subtract the time-mean of each entity away from the values of the variable. We’d do this for all variables.
The result is a new demeaned regression.
Why do we want to do this?
Doing this removes the need for intercept terms. if they all have the same mean, which is zero, we know that all theregression lines would go through the origin. As a result, the itnercept is 0.
So, now we have removed the dummy variables.