# 06.a Linear Regression Flashcards

In linear regression what is the output variable called and what are the input variables called

Output variable = dependent variable

Input variable = independent variable

What is linear regression

Linear regression is an analytical technique used to model the relationship between several input variables and a continuous outcome variable. A key assumption is that the relationship between an input variable and the outcome variable is linear. Although this assumption may appear restrictive, it is often possible to properly transform the input or outcome variables to achieve a linear relationship between the modified input and outcome variables.

What is the theory behind linear regression

Regression analysis attempts to explain the influence that a set of variables has on the outcome of another variable of interest. Regression analysis is a useful explanatory tool that can identify the input variables that have the greatest statistical influence on the outcome.

Name four common use cases for Linear Regression

Real estate

Demand forecasting

Medical correlation

Engineering

What is the relationship expression for Linear Regression

Y = β0 + β1X1 + β2X2 +…+ βnXn + ε

Y = dependent variable (continuous outcome variable)

Xj = independent variables (input variables, j = 1, 2, …, n)

β0 = intercept (the value of Y when each Xj equals 0)

β1..n = coefficients

ε = error between data and the model

We know Y and Xj (historical dataset) and we have to find the regression coefficients (β0, β1..n)

What does OLS mean and what does it do

Ordinary Least Squares. The goal is to find the line that best approximates the relationship between the outcome variable and the input variables. With OLS, the objective is to find the line through these points that minimises the sum of the squares of the difference between each point and the line in the vertical direction.

Name three graph types which are used in Linear Regression

Scatter plot - visualise the linear relationship

Box Plot - to help spot outliers

Denisty Plot - to see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred.

What is the correlation coefficient in Linear Regression

Thecorrelation coefficient (r) quantifies both the strength and direction of the linear relationship between two measurement variables on ascatterplot. 1 = perfectly uphill, 0 = no correlation, -1 = perfectly downhill

What do Anscombes Quartet show

The data sets in the Anscombe’s quartet are designed to have approximately the same linear regression line (as well as nearly identical means, standard deviations, and correlations) but are graphically very different. This illustrates the pitfalls of relying solely on a fitted model to understand the relationship between variables.

Where is a good place to start in R when assessing if variables have correlation

Use the scatterplot to show a matrix of plots of all combinations of variables and look for correlations

What is R squared

R squared is the coefficient of determination and is the proportion of the variance in the dependent variable that is predictable from the independent variable(s)

It ranges from 0 to 1 where 1 is an excellent fit and 0 is no fit.

It is the comparison of the variance compared to the model variance

What is the syntax in R for linear regression

results = lm(y ~ x1 + x2, data = set1)

What is a quick function in R to draw a straight line onto a scatter plot

abline

What is Heteroscedasticity

It is the increasing spread of data away from the line of best fit for increasing data point values. The opposite is Homoscedasticity.

What is Multicollinearity

Multicollinearity means there are relationships between variables. This can impact the interpretation of a linear regression model.