06.a Linear Regression Flashcards

1
Q

In linear regression what is the output variable called and what are the input variables called

A

Output variable = dependent variable

Input variable = independent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is linear regression

A

Linear regression is an analytical technique used to model the relationship between several input variables and a continuous outcome variable. A key assumption is that the relationship between an input variable and the outcome variable is linear. Although this assumption may appear restrictive, it is often possible to properly transform the input or outcome variables to achieve a linear relationship between the modified input and outcome variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the theory behind linear regression

A

Regression analysis attempts to explain the influence that a set of variables has on the outcome of another variable of interest. Regression analysis is a useful explanatory tool that can identify the input variables that have the greatest statistical influence on the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Name four common use cases for Linear Regression

A

Real estate
Demand forecasting
Medical correlation
Engineering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the relationship expression for Linear Regression

A

Y = β0 + β1X1 + β2X2 +…+ βnXn + ε
Y = dependent variable (continuous outcome variable)
Xj = independent variables (input variables, j = 1, 2, …, n)
β0 = intercept (the value of Y when each Xj equals 0)
β1..n = coefficients
ε = error between data and the model
We know Y and Xj (historical dataset) and we have to find the regression coefficients (β0, β1..n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does OLS mean and what does it do

A

Ordinary Least Squares. The goal is to find the line that best approximates the relationship between the outcome variable and the input variables. With OLS, the objective is to find the line through these points that minimises the sum of the squares of the difference between each point and the line in the vertical direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name three graph types which are used in Linear Regression

A

Scatter plot - visualise the linear relationship
Box Plot - to help spot outliers
Denisty Plot - to see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the correlation coefficient in Linear Regression

A

Thecorrelation coefficient (r) quantifies both the strength and direction of the linear relationship between two measurement variables on ascatterplot. 1 = perfectly uphill, 0 = no correlation, -1 = perfectly downhill

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do Anscombes Quartet show

A

The data sets in the Anscombe’s quartet are designed to have approximately the same linear regression line (as well as nearly identical means, standard deviations, and correlations) but are graphically very different. This illustrates the pitfalls of relying solely on a fitted model to understand the relationship between variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Where is a good place to start in R when assessing if variables have correlation

A

Use the scatterplot to show a matrix of plots of all combinations of variables and look for correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is R squared

A

R squared is the coefficient of determination and is the proportion of the variance in the dependent variable that is predictable from the independent variable(s)

It ranges from 0 to 1 where 1 is an excellent fit and 0 is no fit.

It is the comparison of the variance compared to the model variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the syntax in R for linear regression

A

results = lm(y ~ x1 + x2, data = set1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a quick function in R to draw a straight line onto a scatter plot

A

abline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Heteroscedasticity

A

It is the increasing spread of data away from the line of best fit for increasing data point values. The opposite is Homoscedasticity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Multicollinearity

A

Multicollinearity means there are relationships between variables. This can impact the interpretation of a linear regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a Q-Q plot and what is it used for

A

A Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. Good at showing a difference from a normal distribution.

17
Q

What is N-Fold Cross-Validation

A

N-Fold Cross Validation is a technique employed when you don’t have sufficient data for a standard supervised training and test approach. It involves taking splitting the data set into N subsets. Then N-1 data sets are used as the training data set with the remaining one used as the test. Repeat for every subset. Average the errors across the N-Folds.

18
Q

What are the residuals in Linear Regression

A

The residuals are the differences between the actual dependent variable data points and the modelled dependent variable data points.

19
Q

How do you evaluate the residuals in Linear Regression

A

Using the OLS method. Note that an R squared result of 1 means the residuals are all zero, which may suggest over-fitting.

20
Q

How do you calculate variance

A
Calculate the mean
Determine the difference from mean for each term
Square those differences
Sum the results
Divide by n-1