normal linear models Flashcards
(18 cards)
the line of best fit
The line of best fit minimizes the distance to each point
For any line we can do the sum of squared residuals, which is a measure of how close all the points are on to the line of best fit.
Basically, the line of best fit is the line that minimises the sum of the squared residuals.
equation of a line
The equation of a line between 2 variables X and Y is
Y = β 0 + β1 X,
(Y = intercept term + slop term multiplied by X/ where β1 is the slope and the β0 is the intercept.)
what does the slope term tell us
The slope term tells us how much the value of the Y variable changes as X increases by 1 unit.
what does the intercept tell us
The intercept tells where the line crosses the Y axis when X is zero.
residuals
residuals (represented by Ɛi) which is the distance between the line and the point.
what is the distance between yi and β0 + β1 xi is denoted by
ϵi.
These are known as residuals.
so how do we find the line of best fit?
Overall, we find the line of best fit by minimizing the sum of squared residuals. ∑i=1nϵ2i
Simple linear regression as a statistical model
We have some outcome variable (also known as the dependent variable, measurement variable, etc) and a single predictor variable (also known as the independent variable, explanatory variable, etc).
Rather than just saying that simple linear regression is finding a line that best fits a sample of points from these two variables, we say that it is a statistical model describing the general relationship between the predictor and the outcome variable and we are fitting that model to our data.
normal linear model
• For every value of the predictor variable, the distribution over the outcome variable is normally distributed.
As the value of the predictor variable changes, there is linear change in the mean of the distribution of the outcome variable. In other words,
Linear = changes by a proportional amount
mean of outcome =
mean of outcome = linear function of predictor,
mean of outcome = β0 + β1 (intercept term + slope term) × predictor
what is a normal distribution
A probability distribution over a continuous variable.
The normal, or Gaussian, distribution is a probability distribution over a continuous random variable. It has two parameters: The mean, usually denoted by μ, and the variance, usually denoted by σ2. We will denote a normally distributed random variable with mean μand variance σ2 by X∼N(μ,σ2).
what does μ mean
Location parameter Mew/ mean/median/mode
what does σ mean
Sigma parameter mew/ standard deviation.
Sigma tells us the width of the normal distribution, so the larger the value of sigma, the wider that normal distribution is.
The area under any range of values of the normal distribution can be worked using formulas.
o Around 2/3 of the area under the normal distributions is within 1 standard deviation above/below the mean.
o Around 95% of the area under the normal distributions is within 2 standard deviation above/below the mean.
o Around 99% of the area under the normal distributions is within 2.5 standard deviation above/below the mean.
What are linear functions?
• If Y is a linear function of X, if X changes by a certain amount, Y changes by a constant proportion of that amount.
• For one dependent and one independent variable, the linear equation is
Y = β0 + β1 X
• For example, if β0 = 1 and β1 = 2, then if X = 10,
Y = 1 + 2 × 10 Y = 21
• If we increase X by 1 to X = 11 , we have
Y = 1 + 2 × 11 Y = 23
Linear functions with multiple independent variables?
If Y is a dependent variables and we more than one independent variables, e.g., we have two independent variables X1, X2, then if there is linear function between the independent variables and Y, as any one of the independent variables changes by a certain amount then Y changes by a constant proportion of that amount.
e.g. if we change X1 by a certain amount, then Y changes by a constant proportion of that amount
Simple linear regression: Model
- We have n observations, and each one is indexed by i ∈ 1, 2 … n. (i = observation, so observation 1, 2..)
- The outcome variable for observation i is yi.
- The predictor variable for observation i is xi.
- Then the formula for the normal linear model is as follows: For all i ∈ 1, 2…n,
Yi ∼ N(Ci,σ2), outcome variable (Y) is normally distributed (N) with a mean (μi , and the standard deviation is squared)
μi = β0 + β1xi. (that mean is a linear function of the predictor variable)
terciles
Let’s look at the distribution of weight for each tercile of height. The height tercile is the grouping of the height variable into three groups. The first group is from the minimum height to the height at the 33rd percentile. The second group is from the 33rd to the 67th percentile. The third group is from the 67th to the maximum height. Basically, we can see the terciles as the groupings of those of low, medium, and tall heights.