Multiple regression Flashcards
what are the assumptions of regression?
- residuals normally distributed (mean of 0 and SD of sigma squared)
- homoskedasticity (Variance of residual remains constant no matter the value of x
- Residuals not correlated
how exactly do we get the regression line? what method is used
Line is fit using the method of least squares.where the square of the residuals is minimised.
what are we estimating with regression?
the intercept and slope of the population parameter
why center a variable?
to allow for more meaningful interpretation. E.g, imagine centering age to 46, the mean age of a sample. Then the intercept tells us what the DV id for a 46 year old with all other predictors set to 0.
if you standardise a variable e.g., age, how does this impact the interpretation of the regression line
what now does the slope reflect?
After standardising a variable, the slope is interereted as a single SD change, how this would affect the outcome.
if i standardised both the predictor and outcome variable. what does the coeefficient for the X variable reflect?
pearsons correlation coefficient
in simple regression what part of the equation captures the explained and unexplained variance?
systemic part : B0 + B1
Random part: residual (e)
the
describe the form of a regression equation
Response = Systemic part + random part
in simple regression how do we capture the residual variability?
sigma squared
to estimate the residual variability using sigma squared - what assumption needs to be made
that the residuals are normally distributed
What is explained variance in Y, unexplained variance in Y and the total variance
- Explained variance = the variance in Y that is explained by X
- Unexplained variance = residual
- Total variance = sum of explained and unexplained variance.
what is R-squared? What is the formula to calculate this?
- The amount of variance in Y explained by X
- Formula: explained variance / total variance = R-squared
in simple regression what is R squared the same thing as?
The square of the Pearson’s correlation coefficient
what two things affect the standard error
> sampel size
> amount of variability in X and amount of variance in Y unexplained by X (residual variance)
specifically, SE DECREASES with more variability in X, SE INCREASES with more residual variance.
How use the SE to calculate the 95% confidence interval for the slope estimate (B1)
lets say SE is 0.001
CI = slope +/- (1.96 x 0.001)
how do we use the SE to calculate the test statistic (aka the Z or t-ratio)
- SE of sex is 0.025
- coefficient is -0.156
slope / SE = test statistic
so Z ratio is -0.156 / 0.025 = -6.12
we have 2 groups treatment and control. We want to test whether there is a difference int the variance of these groups.
I could use ANOVA or regression. Which is better?
Regression as it allows you to control for the effects of other predictors
multiple regression with categorical predictor. measuring score of hunger in different countries
Germany, UK , france,
UK is reference coutnry.
HUNGER = BO + B1 + B2 + E
what exactly do the B1 and B2 slope reflect
- B1 = the difference in means Germany vs UK
- B2 = difference in means France vs UK
how can we express the null hypothesis for the difference in hedonism scores for Germany vs UK. Then again for France vs UK?
- H0: B1 = 0
- HO: B2 = 0
multiple regression with categorical predictor. measuring score of hedonism in different countries
Germany, UK , france,
hedonism = BO + B1 + B2 + e
what does the regression with dummy coded variables tell us
the mean hedonism score for n in
- uk
- germany
- france
multiple regression with more than one explanatory variable.
Y = b0 + b1x1 + b2x2 + e
what value do we expect each predictor to be at the level of the intercept?
multiple regression with more than one explanatory variable.
Y = b0 + b1x1 + b2x2 + e
interpret the coefficients B1 and B2
- B1 the coefficient for variable X1, interpreted as a change in Y for a single unit change in X1 while controlling for x2.
- Likewise X2 is the change in x2
multiple regression with more than one explanatory variable.
Y = b0 + b1x1 + b2x2 + e
what does the residual mean in this equation?
The variance of Y that isn’t accounted for by X1 nor X2
multiple regression with more than one explanatory variable.
Y = b0 + b1x1 + b2x2 + e
what is the null and alternative hypothesis for this equation?
- H0: Bk = 0
- H1: Bk != 0