Chapter 3 Brooks Flashcards
(32 cards)
what is regression
evaluating the relationship between one variable and movements in one or more other variables
What are we trying to do with regression
We are trying to explain movements in some dependent variable, y. We are trying to explain its movements by using explanatory variables.
what is correlation
The degree of linear association between two variables
what is the danger with correlation
1) Assume causaility
2) Misunderstand what linear association is
Correlation is given as:
cov(X,Y)/(sigma_X sigma_Y)
cov(X,Y) = E[(X - mu_x)(Y - mu_y)]
what is the interpretation of covariance
The interpretation of covariance is “The expected difference from the mean of X multiplied by the difference from the mean of Y.” Meaning, the covariance of X and Y represent how much we expect X to move away from its mean in relation to how much Y moves away from its mean.
elaborate on the role of correlation
The role of correlation is to provide an understanding of the linear relationship between two variables. If the linear relationship is perfect, it means that movement in one variable can perfectly explain the movement in the other.
HOWEVER: The correlation coefficeint is not interpretable for this. The coefficient only tell us the degree of the relationship.
We need to use linear methods to capture the linear relationship. This is where the linear regression come into play.
first step to see if there is a relationship between two variables
Plot it visually
elaborate on:
y = a + bx
This is an exact line. The problem with this is that it doesnt account for errors.
This is a model, a best fitting line, but it is not realistic
What is a better model than
y = a + bx?
y_t = a + bx + u_t
This model assumes a relationship but there is a random disturbance term that always exist. Might be because of how it is impossible to catch everything, etc.
generally speaking, how do we find values for alpha and beta
we need to find the alpha and beta that makes the sum of vertical distances between points y_t and y be as small as possible.
why vertical and not horizontal distanves?
we make use of the assumption that the x-values are fixed in random samples. this means that there is no random element here. Therefore, this assumption reduce our task to finding only the y_t’s.
Most common line fitting method
OLS
general procedure of OLS
Square the distances between y_t and the exact line y. Sum together. find the y-line that makes this sum the samllest
with correct notaiton how do we describe the method of OLS
Minimize the sum of squared differences between y_t and y_t-pred
what is y_t and what is y_t_pred
y_t is the data point, as collected.
y_t_pred is the predicted data point.
Other way of describing the method of OLS
Minimizing the sum of squared errors/residuals
what is the residual mathematically?
The difference between y_t and y_t_pred
what is PRF
Population regression function, the function that is thought to be producing the data
give the PRF
y_t = a + bx_t + u_t
Why does PRF contian error?
Because even though it is the true process, the true process can contian random elements.
What is SRF?
Sample regression function
Give the SRF
y_t_pred = a_pred + b_pred x_t
why not error in SRF
Because it is the best fitting line that we have found to minimize the RSS.
what is CLRM
the model of:
y_t = a + bx_t + u_t
along with the 5 assumptions