# Simple linear regression Flashcards

1
Q

what is linear regression?

A

simple approach to supervised learning. it is used to model the relationship between several input variables (x) and a continuous response variable (y)

2
Q

assumed model?

A

y=β0+β1 X+e

3
Q

distance between observed and predicted values?

A

residual e= Yi-predicted Yi= Yi-(β0+β1Xi)

4
Q

A

total magnitude of deviations from all squared residuals of data points (sum) (residual may be positive or negative thus square)

5
Q

to find β0 and β1, use estimation of least squares

A

first order derivatives of RSS w/ respect to β0 and β1 separately, set to 0

6
Q

predicated β1?

A

cov(x,y)/var(x)

7
Q

predicted β0?

A

mean(y) - β1*mean(x)

8
Q

what is standard error(SE) an estimator for?

A

how the estimates vary under repeated sampling

9
Q

hypothesis testing for relationship between x and y?

A

H0: β1=0 H1: β1!=0

10
Q

t-statistics(to test null hypothesis)

A

t=(β1-0)/SE(β1). n-2 degrees of freedom

11
Q

critical value and confidence interval when n is large?

A

1.96(as n increases, t-dist gets closer to normal dist) and 95%(as n increases, t-dist gets closer to normal dist)

12
Q

p-value definition?

A

probability of observing any value >= |t|

13
Q

calculate confidence interval

A

[β1+-1.96*SE(β1)]

14
Q

when to reject null hypothesis?

A

when |t| both larger than 1.96, we can reject H0 with 95% confidence

15
Q

Residual standard error means?

A

RSE measures lack of fit, if RSE=3.259, on avg, deviation of Y from regression line is 3.259 points

16
Q

R squared for?

A

measures how well regression model describes data. e.g. if R squared is 0.6119, X explains only 61.19% of subject

17
Q

how to measure RSE

A

18
Q

measure R square?

A

1-(RSS/TSS). TSS is the total variance in response variable y. (ranges from 0 to 1)

19
Q

for 95% CI, use β1 or β0

A

β1. β0 has nothing to do with r/s between X and Y

20
Q

how to install package MASS

A

intall.package(‘MASS’)

21
Q

A

library(MASS)

22
Q

A

data(Boston)

23
Q

documentation in data set?

A

?Boston

24
Q

number of missing values?

A

sum(is.na(Boston))

25
Q

number of duplicated values?

A

sum(duplicated(Boston))

26
Q

find outliers for both variables?

A

boxplot. stats(Boston\$var1)\$out

boxplot. stats(Boston\$var2)\$out

27
Q

reduce dataset to subset of the 2 variables?

A

name=subset(Boston, select=c(var1,var2))

28
Q

scatterplot? var1 being y and var2 being x

A
```plot(var1~var2, main='Scatterplot of var1 vs var2',
xlab='var2=name',
ylab='var1=name',
pch=20
col='gray50')```
29
Q

simple linear regression?

A

lmfit=lm(var1~var2,data=name)

30
Q

summary of lm?

A

summary(lmfit)

31
Q

upper and lower range of CI 95%?

A

confit(lmfit, level=0.95)

32
Q

Regression when X is binary

Create a dummy variable that equals to one if rm is above the sample median

A

mydata\$dummy=ifelse(mydata\$x>=median(mydata\$x),1,0)

33
Q

plot scatterplot with fitted line?

A

lmfit1=lmfit(var1~mydata, data=mydata)

plot(var1~mydata\$dummy, main=’scatterplot of var1 vs var2’,xlab=’var2’, ylab=’var1’,pch=20,col=’gray50’)

abline(lmfit1,lwd=2,col=’deeppink3’)