Multiple linear regression Flashcards

1
Q

disadvantage of doing regressions separately?

A

ignores potential synergy effect–>lead to misleading results

2
Q

RSE?

A

sqrt(1/(n-p-1) * RSS). n-(p+1) denominator is the degrees of freedom

3
Q

why does R squared increase when non-zero inputs are added?

A

RSS will decrease when choosing another (non-zero) parameter to estimate logic for finding a new set of coefficient based on minimising RSS

4
Q

what does adjusted R squared do?

A

adds a penalisation factor to account for the number of predictors included in the model

5
Q

formula of adjusted R square?

A

1-(n-1)/(n-1-p)*RSS/TSS. ALWAYS SMALLER THAN R SQAURE(can be neg)

6
Q

what is the null hypothesis H0?

A

all regression coefficients are 0 simultaneously

7
Q

formula of F stats?

A

8
Q

when no relation what is F stat?

A

1

9
Q

when to reject null hypothesis?

A

p-value<0.05

10
Q

how does forward selection work?

A
1. start with null model with intercept but no predictor
2. successively include most informative variable (lowest RSS, highest R square)
3. stop when stopping rule is reached (all variables have p-value<0.05)
11
Q

how does backward elimination work?

A
1. start with full model with intercept and all predictors
2. successively remove least informative variable (highest RSS, lowest R squared)
3. stop when stopping rule is reached (all variables have p-value<0.05)
12
Q

how does cross validation work?

A
1. split dataset into training and testing set
2. train model using training set
3. validate fitted model using testing set
13
Q

how is validation error rate assessed?

A

mean squared error (1/n*RSS)

14
Q

process of leave one out CV?

A

1,fit training data (obs=n-1) into a model

1. validate the model using testing set (obs=1)
2. compute to test MSE for first round
3. repeat 1-3 for n times to obtain n MSEs
4. construct LOOCV estimate as avg for MSEs
15
Q

K-fold CV?

A
1. randomly split observations into k groups
2. fit training data (obs=n-n1) into a model
3. validate the model using testing set (obs=n1)
4. compute to test MSE for first round
5. repeat 2-4 for k times to obtain k MSEs
6. construct K-FOLD CV estimate as avg of k MSEs
16
Q

lm for multiple linear regression?

A

lm_fit=lm(y~var1+var2,data=Boston)

summary(lm_fit)

17
Q

get model with all variables?

A

lm_fit1=lm(y~. , data=Boston)

18
Q

remove one or two variables?

A

lm_fit2=lm(y~. -var1 , data=Boston)

lm_fit3=lm(y~.-var1 -var2, data=Boston)

19
Q

get null set?

A

lm_fit4=lm(y~1,data=Boston)

20
Q

get correlation for all inputs?

A

round(cor(Boston),2)

21
Q

how to visualise the pair-wise correlation matrix?

A

install.packages(‘corrplot’)
library(corrplot)
cor_matrix=round(cor(Boston), 2)
corrplot(cor_matrix, type = “upper”, order = ‘alphabet’,
tl.col = “black”, tl.srt = 45, tl.cex = 0.9,
method = “circle”)

22
Q

for variables with high correlation, how to remove additive assumption?

A

23
Q

scatterplot with linear assumption?

A

attach(Boston)

plot(var1,y,pch=16,col=’gray50’)

24
Q

for polynomial?

A

lm_nonlinear=lm(y~poly(var1,2),data=Boston)

25
Q

add line to scatterplot of non linear?

A

lines(sort(y),fitted(lm_nonlinear)[order(y)],lwd=2,col=’deeppink3’)

26
Q

find just the coefficient of variable n intercept?

A

glm_fit=glm(y~x,data=Boston)

coef(glm_fit) #same as lm_fit

27
Q

find LOOCV estimates?

A

cv_err=cv.glm(Boston,glm_fit)

cv_err\$delta[1]

28
Q

finding CV of different models and lowest MSE?

A

glm_fit2=glm(y~poly(x,2), data=Boston)
cv_err=cv.glm(Boston,glm_fit2)
OR
USE FOR LOOP
cv_error=rep(0,10)
for (i in 1:10){glm_fit=glm(y~poly(x,i), data=Boston)
cv_error[I]=cv.glm(Boston,glm_fit)\$delta[1]}

29
Q

plotting CV errors?

A

plot(cv_error,xlab=’polynomial’,main=’test MSE’, ylab=’‘,type=’b’, pch=16)

30
Q

CV for K fold?

A

cv_error=rep(0,10)
for (i in 1:10){glm_fit=glm(y~poly(x,i), data=Boston)
cv_error[I]=cv.glm(Boston,glm_fit,K=10)\$delta[1]