06.a Linear Regression Flashcards

1
Q

In linear regression what is the output variable called and what are the input variables called

A

Output variable = dependent variable

Input variable = independent variable

2
Q

What is linear regression

A

Linear regression is an analytical technique used to model the relationship between several input variables and a continuous outcome variable. A key assumption is that the relationship between an input variable and the outcome variable is linear. Although this assumption may appear restrictive, it is often possible to properly transform the input or outcome variables to achieve a linear relationship between the modified input and outcome variables.

3
Q

What is the theory behind linear regression

A

Regression analysis attempts to explain the influence that a set of variables has on the outcome of another variable of interest. Regression analysis is a useful explanatory tool that can identify the input variables that have the greatest statistical influence on the outcome.

4
Q

Name four common use cases for Linear Regression

A

Real estate
Demand forecasting
Medical correlation
Engineering

5
Q

What is the relationship expression for Linear Regression

A

Y = β0 + β1X1 + β2X2 +…+ βnXn + ε
Y = dependent variable (continuous outcome variable)
Xj = independent variables (input variables, j = 1, 2, …, n)
β0 = intercept (the value of Y when each Xj equals 0)
β1..n = coefficients
ε = error between data and the model
We know Y and Xj (historical dataset) and we have to find the regression coefficients (β0, β1..n)

6
Q

What does OLS mean and what does it do

A

Ordinary Least Squares. The goal is to find the line that best approximates the relationship between the outcome variable and the input variables. With OLS, the objective is to find the line through these points that minimises the sum of the squares of the difference between each point and the line in the vertical direction.

7
Q

Name three graph types which are used in Linear Regression

A

Scatter plot - visualise the linear relationship
Box Plot - to help spot outliers
Denisty Plot - to see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred.

8
Q

What is the correlation coefficient in Linear Regression

A

Thecorrelation coefficient (r) quantifies both the strength and direction of the linear relationship between two measurement variables on ascatterplot. 1 = perfectly uphill, 0 = no correlation, -1 = perfectly downhill

9
Q

What do Anscombes Quartet show

A

The data sets in the Anscombe’s quartet are designed to have approximately the same linear regression line (as well as nearly identical means, standard deviations, and correlations) but are graphically very different. This illustrates the pitfalls of relying solely on a fitted model to understand the relationship between variables.

10
Q

Where is a good place to start in R when assessing if variables have correlation

A

Use the scatterplot to show a matrix of plots of all combinations of variables and look for correlations

11
Q

What is R squared

A

R squared is the coefficient of determination and is the proportion of the variance in the dependent variable that is predictable from the independent variable(s)

It ranges from 0 to 1 where 1 is an excellent fit and 0 is no fit.

It is the comparison of the variance compared to the model variance

12
Q

What is the syntax in R for linear regression

A

results = lm(y ~ x1 + x2, data = set1)

13
Q

What is a quick function in R to draw a straight line onto a scatter plot

A

abline

14
Q

What is Heteroscedasticity

A

It is the increasing spread of data away from the line of best fit for increasing data point values. The opposite is Homoscedasticity.

15
Q

What is Multicollinearity

A

Multicollinearity means there are relationships between variables. This can impact the interpretation of a linear regression model.

16
Q

What is a Q-Q plot and what is it used for

A

A Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other. Good at showing a difference from a normal distribution.

17
Q

What is N-Fold Cross-Validation

A

N-Fold Cross Validation is a technique employed when you don’t have sufficient data for a standard supervised training and test approach. It involves taking splitting the data set into N subsets. Then N-1 data sets are used as the training data set with the remaining one used as the test. Repeat for every subset. Average the errors across the N-Folds.

18
Q

What are the residuals in Linear Regression

A

The residuals are the differences between the actual dependent variable data points and the modelled dependent variable data points.

19
Q

How do you evaluate the residuals in Linear Regression

A

Using the OLS method. Note that an R squared result of 1 means the residuals are all zero, which may suggest over-fitting.

20
Q

How do you calculate variance

A
```Calculate the mean
Determine the difference from mean for each term
Square those differences
Sum the results
Divide by n-1```