Lecture 6 - Statistical Tests II: Linear Regression Flashcards
(34 cards)
in what instance would we select a linear regression?
interested in association → interested in trend [where x is continuous] → “experiment” → y-continuous → y-normal → linear regression
linear regression is a statistical model which shows:
the relationship between 2 continuous variables
what questions should we be asking when we are choosing a statistical test?
(1) what type of response variable? [continuous, discrete/count, proportion, binary]
(2) what type of explanatory [continuous, discrete/count, proportion, binary, categorical]
(3) interested in differences or trends/relationships?
(4) paired or independent sample
(5) normal/normal distribution
what type are variables are present when we select a chi-squared statistical test?
when we are dealing with two categorical variables (y-counts, x categorical)
what variables must be present in order for us to carry out a linear regression statistical analysis?
for a linear regression we must both have a continuous X & Y variable
gradient =
change in y / change in x
what are the three stages when trying to calculate a linear regression?
(1) choose your model: linear / non-linear
(2) estimate the parameters of the model
(3) model fit: how well does the model describe our data
what is a Y bar that is found horizontally across the span of a graph?
a Y-bar, indicated by a dotted line labeled with a Y with a line on top of it shows the mean value line in your data
how do you calculate the total sum of squares?
total sum of squares is the sum of all the squared distances between your data points and the Y-Line
what is the equation of a line and its units?
y [w/ a hat] = a +bx
where:
a = intercept
b = slope
what is the error sum of squares (residuals)?
error sum of squares or residuals = the sum of all the distances between each individual data point and the line of best fit (y[w/ a hat] = a + bx)
what must all lines of best fit pass through and what allows us to choose what line of best fit is the most appropriate one?
all lines of best fit need to go through the mean-line of Y & X
we select the best line when the unexplained variation in our response is the smallest - when our residuals are the smallest
with regression, if the slope is positive or negative, what does this show about the relationship between the two variables?
if the slope is positive: the relationship between the variables is positive
if the slope is negative: the relationship between the variables is negative
what happens to the total sum of squares, SST, if we add additional data points?
the value gets larger
how to calculate mean sum of squares:
calculate mean variability = mean sum of squares (MS) = divide our total sums of squares by our sample size
mean sum of squares = sum of square deviations from the mean / degrees of freedom
how do you construct and fill out an ANOVA table?
source | SS |D.O.F| MS
regression | SSR| 1 | SSR
error | SSE| n-2 |S^2=SSE/N-2
total | SST| n-1 |
[for regression: you need an additional column called ‘F’ which is the statistic in which F = SSR / S^2
degrees of freedom regarding SST & SSE:
- SST requires estimation of 1 parameter (mean of Y) => n-1 degrees of freedom
- SSE requires estimation of 2 parameters (mean of y, slope) => n-2 degrees of freedom
SSR + SSE =
SST
F-distribution percentile (5%) command in R:
qf(0.95,1,n-2)
what is F is larger than the critical value?
this means that we must accept the alternative hypothesis and reject the null hypothesis - we infer that the probability that relationship is due to chance is <0.05
we are only allowed to add a trend line if:
we are only allowed to finally add out trend line if we reject the null hypothesis that the slope = 0, if the slope is not significantly different from 0, we must not add a trend line
we are only allowed to carry out a linear regression if certain assumptions are fufilled:
- residuals must be normally distributed
- the variance associated with the distribution of the the residuals ions constant (ie. variation in y does not increase with increasing x)
- individual measurements are independent
- data comes from a random sample
how can we test if our assumptions are met or violated when questioning wether we can write a linear regression?
we can see if our assumptions are violated through using diagnostic plots in R
residuals vs fitted: in the first one we ask whether the variance is consistent or constant [we want it to look scattered like the sky at night]
normal Q-Q: int he second plot we check whether residuals are normally distributed and we want them to fall onto the regression line (just about)
full and complete command needed to have a linear regression in R:
(1) data<-read.csv(“excel_sheet1.csv”, header = T, stringsAsFactor = T)
(2) attach(data)
(3) names(data)
(4) m1<-lm(y variable~x variable)
#”m1” is simply your model name
(5) summary.lm(m1)
(6) summary.aov(m1)
(7) plot(m1)
once you see the sky at night and straight line you can assume normal distribution and therefore plot your linear regression