Linear Regression Flashcards

1
Q

What is one of the most common methods of prediction?

A

Regression Analysis

it is used whenever we have a casual relationship between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Linear Regression?

A

a linear regression is a linear approximation of a causal relationship between two or more variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is the Dependent Variable labeled? (the predicted variable)

A

as Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are Independent Variables labeled? (the predictors)

A

x1, x2, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In Y hat - what does the hat denote?

A

An estimated or predicted value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the simple linear regression formula?

A

Y hat = b0 + b1 * x1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

You have an ice-cream shop. You noticed a relationship between the number of cones you order and the number of ice-creams you sell. Is this a suitable situation for regression analysis?

Yes

No

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

You are trying to predict the amount of beer consumed in the US, depending on the state. Is this regression material?

Yes

No

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does correlation measure?

A

The degree of relationship of two variables

it doesn’t capture causality but shows that two variables move together (no matter in which direction)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the purpose of regression analysis?

A

To see how one variable affects another, or what changes it causes the other

it shows no degree of connection but cause and effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which statement is false?

Correlation does not imply causation.

Correlation is symmetrical regarding both variables.

Correlation could be represented as a line.

Correlation does not capture the direction of the causal relationship.

A

Correlation could be represented as a line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does it mean if x and y have a positive correlation?

An increase in x translates to a decrease in y.

An increase in y translates to a decrease in x.

The variables x and y tend to move in the same direction.

None of the above

A

The variables x and y tend to move in the same direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Assume you have the following sample regression: y = 6 + x. If we draw the regression line, what would be its slope?

1

6

x

None of the above

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does a p-value of 0.503 suggest about the intercept coefficient?

It is significantly different from 0.

It is not significantly different from 0.

It is equal to 0.503.

None of the above.

A

It is not significantly different from 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does a p-value of 0.000 suggest about the coefficient (x)?

It is significantly different from 0.

It is not significantly different from 0.

It does not tell us anything.

None of the above.

A

It is significantly different from 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the predicted GPA of students with an SAT score of 1850? (Unlike in the lectures, this time assume that any coefficient with a p-value greater than 0.05 is not significantly different from 0)

3.42

3.06

3.23

3.145

A

3.145

Using the value of the coefficients in front of const and SAT, let’s write down the corresponding formula for linear regression, namely:
GPA = 0.2750 + 0.0017SAT
We can see that the variable const has a p-value of 0.503 which makes it statistically insignificant. The question asks to make a prediction excluding such insignificant variables and that reduces the equation above down to
GPA = 0.0017
SAT
Now, plugging in SAT = 1850, we obtain the desired result.

Hope this helps!

Kind regards,
365 Hristina

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the Sum of Squares Total?

A

denoted: SST, or TSS

squared difference between the independent variable and its mean

measures the total variability of the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the Sum of Squares Regression?

A

SSR or ESS

sum of the differences between predicted value and the mean of the dependent variable

a measure that describes of how well your line fits the data

if equal to SST then the model captures all the variability and is perfect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the Sum of Squares Error?

A

SSE or RSS

the difference between the observed value and the predicted value

the smaller the error the better the estimation power of the regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the connection between SST, SSR, and SSE?

A

SST = SSR + SSE

the total variability of the dataset = the explained variability by the regression line + the unexplained variability

a lower error will cause a more powerful regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which of the following is true?

SST = SSR + SSE

SSR = SST + SEE

SSE = SST + SSR

A

SST = SSR + SSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the OLS?

A

Ordinary Least Squares

The most common method to estimate the linear regression equation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What software do beginner statisticians prefer?

A

Excel, SPSS, SAS, STATA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What software do data scientist prefer?

A

Programming languages like, R and Python

the offer limitless capabilities and unmatched speed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are other methods for determining the regression line?

A
  • Generalized Least Squares
  • Maximum likelihood estimation
  • Bayesian Regression
  • Kernel Regression
  • Gaussian Process Regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Since OLS (Ordinary Least Squares) is simple enough to understand, why do advanced statisticians prefer using programming languages to solve regressions?

Limitless capabilities and unmatched speed.

Other software cannot compute so many calculations.

Huge datasets cannot be used in Excel

None of the above.

A

Limitless capabilities and unmatched speed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the R-squared?

A

R2 = SSR/SST

it measures the goodness of fit or your model - the more factors you include in your regression, the higher the R-squared

a relative measure and takes values ranging from 0-1

R2 = 0 means your regression line explains none of the variability of the data.

R2 = 1 means your regression line explains all of the variability and is perfect

Typical range: 0.2 - 0.9

28
Q

SST = 1245, SSR = 945, SSE = 300. What is the R-squared of this regression?

0.24

0.52

0.76

0.87

A

0.76

29
Q

The R-squared measures:

How well your data fits the regression line

How well your regression line fits your data

How well your data fits your model

How well your model fits your data

A

How well your model fits your data

ie. it measures how much of the total variability is explained by our model

30
Q

What is the best fitting model?

A

The least SSE

the lower the SSE the higher the SSR
The more powerful the model is

31
Q

Why do we prefer using a multiple linear regression model to a simple linear regression model?

Easier to compute.

Having more independent variables makes the graphical representation clearer.

More realistic - things often depend on 2, 3, 10 or even more factors.

None of the above.

A

More realistic - things often depend on 2, 3, 10 or even more factors.

also, multiple regressions are always better than simples ones, as with each additional variable your add, the explanatory power may only increase or stay the same.

32
Q

What is the Adjusted R-squared?

A

Always smaller than the R-squared, as it penalizes excessive use of variables

33
Q

The adjusted R-squared is a measure that:

measures how well your model fits the data

measures how well your model fits the data but penalizes the excessive use of variables

measures how well your model fits the data but penalizes excessive use of p-values

measures how well your data fits the model but penalizes the excessive use of variables

A

measures how well your model fits the data but penalizes the excessive use of variables

34
Q

The adjusted R-squared is:

usually bigger than the R-squared

usually smaller than the R-squared

usually the same as the R-squared

incomparable to the R-squared

A

usually smaller than the R-squared

35
Q

What can you tell about a new variable if adding it increases R-squared but decreases the adjusted R-squared?

The variable improves our model

The variable can be omitted since it holds no predictive power

It has a quadratic relationship with the dependant variable

None of the above

A

The variable can be omitted since it holds no predictive power

36
Q

What is the F-statistic?

A

It is used for testing the overall significance of the model

The lower the F-statistic the closer to a non-significant model

It follows an F distribution. It is used for tests.

37
Q

What are the 5 linear regression assumptions?

A
  1. Linearity
  2. No endogeneity
  3. Normality and homoscedasticity
  4. No autocorrelation
  5. No multicollinearity
38
Q

What is one of the biggest mistakes you can make in OLS?

A

is to perform a regression that violates one of the 5 assumptions.

39
Q

If a regression assumption is violated:

Some things change.

You cannot perform a regression.

Performing regression analysis will yield an incorrect result.

It is no big deal.

A

Performing regression analysis will yield an incorrect result.

40
Q

Why is a Linear Regression called linear?

A

Because the equation is linear

41
Q

How can you verify if the relationship between two variables is linear?

A

Plot the independent variable x1 against the dependent variable y on a scatter plot and if the result looks like a line then a linear regression model is suitable

if the relationship is non linear you should not use the data before transforming it appropriately

42
Q

What are some fixes for when a relationship between x1 and y is not linear?

A
  1. Run a non-linear regression
  2. Exponential transformation
  3. Log transformation

*if the relationship is non linear you should not use the data before transforming it appropriately

43
Q

What should you do if you want to employ a linear regression but the relationship in your data is not linear?

Not use it.

Ignore it and proceed with your analysis

Transform it appropriately before using it.

None of the above.

A

Transform it appropriately before using it.

44
Q

What is No Endogeneity or regressors?

A

Endogeneity refers to situations in which a predictor in a linear regression model is correlated to the error term.

The error becomes correlated with everything else

45
Q

What is Omitted Variable Bias?

A

It happens when you forget to include a relevant variable

everything you don’t explain in your model goes into the error

Leads to biased and counterintuitive estimates

46
Q

What are the sources of Endogeneity?

A

There is a wide range of sources of Endogeneity. The common sources of Endogeneity can be classified as: omitted variables, simultaneity, and measurement error.

47
Q

The easiest way to detect an omitted variable bias is through:

the error term

the independent variables

the dependent variable

sophisticated software

A

the error term

48
Q

What should you do if the data exhibits heteroscedasticity?

Try to identify and remove outliers

Try a log transformation

Try to reduce bias by accounting for omitted variables

All of the above.

A

All of the above.

49
Q

How does one detect autocorrelation?

A

Plot all the residuals on a graph and look for patters. If you can’t find any, you are safe.

or

Burbin-Watson test: values are from 0-4
2 -> no autocorrelation
<1 and >3 are cause for alarm

50
Q

Autocorrelation is not likely to be observed in:

time series data

sample data

panel data

cross-sectional data

A

cross-sectional data

51
Q

How do you fix autocorrelation when using a linear regression model?

Try to identify and remove outliers.

Use log transformation.

Try to reduce bias by accounting for omitted variables.

None of the above.

A

None of the above.

52
Q

What is multicollinearity?

A

When two or more variables have a high correlation

53
Q

How do we fix multicollinearity?

A
  1. Drop one of the two variables;
  2. Transform them into one;
  3. Keep them both.
54
Q

How do we determine multicollinearity?

A

Before creating the regression, between each two pairs of independent variables

55
Q

No multicollinearity is:

easy to spot and easy to fix

easy to spot but hard to fix

hard to spot but easy to fix

hard to spot and hard to fix

A

easy to spot and easy to fix

56
Q

What is a Dummy Variable?

A

A variable that is used to include categorical data into a regression model

57
Q

What is a Dummy Variable?

A

A variable that is used to include categorical data into a regression model

58
Q

How do you determine the variables that are unneeded in a model?

A

feature selection through p-values

if a variable has a p-value > 0.05, we can disregard it

r-squared will also show how well a model is fit

59
Q

What does feature selection do?

A

It simplifies models, improves speed, and prevents a series of unwanted issues arising from having too many features

60
Q

What is a common problem when working with numerical data in linear regressions? What is the fix?

A

Differences in magnitudes

The fix is: standardization or feature scaling or normalization - all the same thing

How: subtracting the mean and dividing by the standard deviation

61
Q

What are Standardized Coefficients or Weights?

A

The bigger the weight, the bigger the impact. It carries weight on the result.

62
Q

What is another name for Intercepts in ML?

A

Bias - if a model needs to have it..

63
Q

How do you interperet Weights in result summaries?

A

The closer a weight is to 0, the smaller its impact;
the bigger the weiht, the bigger its impact

64
Q

What is overfitting and how do we deal with it?

A

Our training has focused on the particular training set so much, it has “missed the point”

split the dataset into two - a training set and a test set

Splits of 80/20 or 90/10 are common

65
Q

What is underfitting?

A

The model has not captured the underlying logic of the data

it provides an answer that is far from correct

You’ll realize that there are not relationships to be found or you need a different model

66
Q

What is one of the best ways of checking for multicollinearity?

A

Through VIF - variance inflation factor

VIF = 1 - no multicollinearity at all (min value of measure)
VIF = 1- 5 - considered perfectly ok
VIF = > 5 - could be unacceptable