L9 - The Regression Model Flashcards
What is Econometrics?
= Econometrics literally means economic measurement: it is the use of statistical methods to analyze economic models, i.e. formal, mathematical relationships between variables: Y=f(x)
- The dependent variable Y is determined by the independent variable(s) X. Different from correlation. The causal link X =>Y is
based on (economic) theory:
- Wages depend on experience and schooling;
- Demand depends on own price, income and the price of substitute goods etc
What else do use use Econometrics for?
- We use econometrics to quantify these relationships [Y=f(x)] and:
- Measure by how much Y is affected by X;
- Test theories and hypotheses;
- Predict the value of Y given the value of X;
- One way to quantify such relationships is regression analysis. A very popular method of estimation is Least Squares (LS, or OLS –
Ordinary LS).
What is the Simple Regression Model?
- Simple as it only has one independent variable - Economic model (eg. consumption Y depends on income X)
- Y{i}= α + βX{i} + u{i}
- α and β : the population parameters, the “true” parameters of this relationship that define the Population Regression Function. We would need access to the whole population.
- Y{i} –> Actual value of Y
of each individual i - α + βX{i} –> E(Y|X{i}) –> Expected value of Y given
X: deterministic part of
the equation, the Population Regression Function (PRF), the values on the line –> this is the average REVIEW - u{i} –> Stochastic part, the error
What does the regression model look like on a graph?
- the points –> the actual data, the (Y{i}, Xi{}) pairs: eg, the actual consumption and
income level of each individual i (or time t if we have time series data of (Y{t}, X{t}). - the best fit line –> Red line = the modelled relationship Y{i}= α + βX{i}: the expected consumption of each
individual given their income level. What on average people at that income level are expected to consume, according to the model. - The distance between the two is the error
- This model then equates to –>Y{i}= α + βX{i} + u{i}
- represents what happens in theory for the whole population; it is the “true”
relationship and it is characterised by parameters α and β .
What is the deterministic part of the Simple Regression Model?
- The deterministic part Y(hat){i}= α + βX{i} + u{i} is the Population Regression Function or PRF. It represents the average, or expected, level of Y for
each given level of X (the expected value of Y conditional upon X) - The errors allow for differences between modelled behaviour and reality (randomness in human behaviour, unobserved variables, errors of measurement etc). They are distributed above and below the regression
line and it is assumed they cancel each other out on average (they have a
0 mean).
What do we do when we dont have access to the population?
- Key issue is that we have no access to the population => we use a sample and derive the Sample Regression Function (SRF);
- Rather than the true α and β we estimate a and b, their sample estimator:
-Y{i}= a + bX{i} + u(hat){i} or Y{i}=Y(hat){i}+ u(hat){i} - The deterministic part (the SRF) again is a line of points Y(hat){i}= a + bX{i}
- Again the actual value of Y (Y{i}) is equal to the expected value (Y(hat){i} )
plus an error ( u(hat){i}). - The estimators we get, the a and b, will vary with the sample we have; they will not be identical across samples or to the true
parameters. –> two different samples can give us to different values of a and b
How can we summarise the True Model of PRF and the Sample Model of SRF?
True Model: PRF
- Y{i}= α + βX{i} + u{i}
- α = intercept parameter
- β = slope parameter
- u = error, or noise
- Parameters are fixed
Sample: SRF
- Y{i}= a + bX{i} + u(hat){i}
- a = sample estimator of α
- b = sample estimator of β
- u(hat) = residual
- Estimators are variables
- Our job is to calculate the estimators (a and b ) and make inference about the true parameters (α and β) on the basis of them.
- There are several methods of estimation but we focus only on LS
What is the Ordinary Least Square?
REVIEW - What is the definition of this - do i just need DA I for it
- Given our model Y{i}= a + bX{i} + u(hat){i}
- the method of LS finds a
and b by minimizing the sum of the squared residuals, or RSS
(residual sum of squares); - Intuitively, we draw the line as close to the data as possible, i.e. we make the distance between line and points (the errors) as small as possible.
What is often do between relationship in Econometrics?
- Many economic (econometric) models are expressed in natural logs.
- If both the dependent and independent variables of your regression are
in natural logs then the slope coefficient is an elasticity
I.e.
Y{i} = 0.9 + 0.16 X{i} +u{i}
- means that a 1-unit increase in X leads to a 0.16 units increase in Y
BUT
ln(Y{i} )= 0.9 + 0.16ln(X{i})+u{i}
- means that a 1% increase in X will lead to a 0.16% increase in Y.
- (This is because β= dln(Y)/dln(X) and changes in logs are relative changes)
How can we use the result we get from the a + bX{i}?
also E[Y|X]: the expected value of Y conditional upon X the points on the regression line –> shows use the expected results
- Since dY/dX = b we can also calculate that dY = b (dX) –> expected change in Y when X changes
- Calculating the expected value is called the forecast
What is the nature of the problem when testing hypotheses of a regression model?
- The nature of the problem: we want to know something about the
true α and β using the information we have which are a and b. - Since each of a and b is an estimator it is a variable that comes from a
distribution: it can take many different values with different probability. - The values will depend on the sample used. Exactly as we saw for the sample means etc.
- Under certain conditions it can be showed that the LS estimator is unbiased, efficient and consistent and that it follows a Normal
distribution: - a~ N(α, σ^2{a})
- b~ N(β, σ^2{b})
REVIEW –>BLUE estimator
How do you find a confidence interval of a regression model?
- Just looking at the value of b ( its the same for a with its own respective values)
- b~ N(β, σ^2{b})
- Z ~ (b - β/σ^2{b})
- t ~ (b - β/s{b})
- i.e. if we knew the true variance (standard deviation) of each estimator we could standardise the normal into a standard normal Z~N(0,1).
- Since we have to use the estimated standard deviations the resulting
distribution is a t distribution with n-k degrees of freedom. - (where k is the number of coefficients including the intercept, so here 2)
- As before, we choose a probability range of size (1-α) (eg 0.95) and
proceed: - P[ -t* < (b - β/s{b}) < +t*) = ( 1 - α)
- b - β –> This is a true parameter hence the true mean of b
- P[ b-s[b}t* < β < b +s{b}t*] = (1 - α)
- α is the significance level; this is often (but not necessarily) equal to
0.05 (5%) so that (1- α) is 0.95 or 95%. - As before I am constructing a distribution centred around my b and I
construct an interval around this estimate to have an idea about where the true β could be. - NOTATION WARNING:
do not confuse the two αs meanings!
How do you do a specific test of a regression model?
- We formulate a null hypothesis H{0} about the parameter, for ex: H{0}: β= β*;
- We formulate an alternative hypothesis H{1}: this is what happens if H{0} is not true:
- H{1}: β ≠ β*: composite alternative, two-tailed test;
- H{1} : β > β*: one-sided alternative, right-tailed test;
- H{1} : β < β*: one-sided alternative, left-tailed test;
- you can then use either the confidence interval or t-statistic to test the hypotheses
How do you use the confidence intervals to test hypotheses of regression models?
REVIEW
How do you use the t-statistic to test hypotheses of regression models?
- Alternatively we can calculate the t-statistic directly, call it t(hat):
- t(hat)~ (b - β*/s{b})
- P[ -t* < (b - β/s{b}) < +t] = ( 1 - α)
-We do not reject the null if the statement above is true and the test statistic falls between +/-t* (the acceptance region); we
reject it if it falls in the tails. - Note in fact what the formula means: IF the null is true and β is really β*, THEN the above statement must be correct.