3. Linear Regression Flashcards
Sample covariance and correlation, least square regression, alternative regression models
What is regression used for?
- many data sets have observations of several variables for each individual
- the aim of regression is to ‘predict’ the value of one variable, y, using observations from another variable, x
What is linear regression used for?
-linear regression is used for numerical data and uses a relation in the form:
y ≈ α + βx
-in a plot of y as a function of x, this relation describes a straight line
Paired Samples
- to fit a linear model we need observations of x and y
- it is important that these are paired samples, i.e. that for each iϵ{1,..,n} the observations xi and yi belong to the same individual
Examples of Paired Samples
- weight and height of a person
- engine power and fuel consumption of a car
Linear Regression
Constructing a Model
-assume we have observed data (xi,yi) for iϵ{1,…,n}
-to construct a model for these data, we use random variables Y1,…,Yn such that:
Yi = α + βxi + εi
-for all iϵ{1,…,n} where ε1,…,εn are i.i.d random variables with E(εi)=0 and Var(εi)=σ²
-here we assume that the x-values are fixed and known
-thus the only random quantities in the model are Yi and εi
-the values α, β and σ² are parameters of the model, to fit the model to data we need to estimate these parameters
Linear Regression
Residuals/Errors
-starting with the model:
Yi = α + βxi + εi
-the random variables εi are called residuals or errors
-in a scatter plot they correspond to the vertical distance between the samples and the regression line
-often we assume that εi~N(0,σ²) for all i
Linear Regression
Expectation of Yi
-we have the linear regression model:
Yi = α + βxi + εi
-then the expectation is given by:
E(Yi) = E(α + βxi + εi)
-the expectation of a constant is just the constant itself, and remember that xi represents a known value here:
E(Yi) = α + βxi + E(εi)
-recall that εi are modeled as random variables with E(εi)=0:
E(Yi) = α + βxi
-thus the expectation of Yi depends on xi and, at least for β≠0, the random variables Yi are not identically distributed
What are sample covariance and correlation used for?
-to study the dependence between paired numeric variables
Sample Covariance
Definition
-the sample covariance of x1,…,xnϵℝ and y1,…,ynϵℝ is given by:
σxy = 1/(n-1) Σ(xi-x^)(yi-y^)
-where the sum is taken from i=1 to i=n, and x^ and y^ are the sample means
Sample Correlation
Definition
-the sample correlation of x1,…,xnϵℝ and y1,…,ynϵℝ is given by:
ρxy = σxy / √( σx² σy²)
What is the sample covariance of a sample with itself?
- we can show that the sample covariance sample with itself equals the sample variance
- i.e. Cov(X,X) = Var(X)
What values can correlation take?
-the correlation of two samples is always in the interval [-1,1]
Interpreting Correlation
ρxy≈1
- strong positive correlation, ρxy≈1 indicates that the points (xi,yi) lie close to a straight line with increasing slope
- in this case y is almost completely determined by x
Interpreting Correlation
ρxy≈-1
- strong negative correlation, ρxy≈-1 indicates that the points (xi,yi) lie close to a straight line with increasing slope
- in this case y is almost completely determined by x
Interpreting Correlation
ρxy≈0
- this means that there is no linear relationship between x and y which helps to predict y from x
- this could be because x and y are independent or because the relationship between x and y is non-linear
How can the sample covariance be used to estimate the covariance of random variables?
-if (X1,Y1),…,(Xn,Yn) are i.i.d. pairs of random variables, then we can show:
lim σxy(X1,…,Xn,Y1,…,Yn) = Cov(X1,Y1)
-where the limit is taken as n tends to infinity
Correlation and Covariance in R
-the functions to compute sample covariances and correlations in R are cov() and cor()
Correlation and Covariance in R
else
- both functions, cov() and cor() have an optional argument use=… which controls how missing data is handled
- -if use=’everything’ or is not specified, the functions return NA if any input data is missing
- -if use=’all.obs’, the functions abort with an error if any input data are missing
- -if use=’complete.obs’, any pairs (xi,yi) where either xi or yi is missing are ignored and the covariance/correlation is computed using the remaining samples
What is least squares regression?
- least squares is a method for determining the parameter values α, β and σ²
- most methods for doing this differ in the way that they consider outliers in the data
Least Squares Regression
Minimising the Residual Sum of Squares - Formula
-we estimate the parameters α, β and σ² using the values which minimise the residual sum of squares:
r(α,β) = Σ (yi - (α + βxi))²
-for given α and β, the value r(α,β) meaures how close the given data points (xi,yi) are to the regression line α+βx
-by minimising r(α, β) we find the regression line which is closest to the data
Least Squares Regression
Minimising the Residual Sum of Squares - Lemma
-assume σx²>0
-then the function r(α,β) from takes its minimum at the point (α,β) given by:
β = σxy/σx²
α = y^ - βx^
-where x^, y^ are sample means, σxy is the sample covariance and σx²is the sample variance
Least Squares Regression
Minimising the Residual Sum of Squares - Lemma Proof
-obtain a simplified expression for r(α,β) using the substitutions:
xi~ = xi - x^
yi~ = yi - y^
-differentiate this with respect to beta and set equal to 0 to impose the condition for beta at a stationary point
-the second derivative should be greater than 0 showing that the expression for beta applies at the minimum value of r(α,β)
Least Squares Regression
Fitted Regression Line
-now that we have used the method of least squares to determine the values α^ and β^, the values which minimise r(α,β)
-we can consider the fitted regression line:
y = α^ + x*β^
-this is an approximation to the unknown true mean α+βx from the model
Least Squares Regression
Fitted Values
-now that we have used the method of least squares to determine the values α^ and β^, the values which minimise r(α,β)
-we can consider the fitted values:
yi^ = α^ + xi*β^
-these are the y-values of the fitted regression line at the points xi
-if we consider εi as being the ‘noise’ or ‘errors’, then we can consider the values yi^ to be the versions of yi with the noise removed