Statsmodels Flashcards

1
Q

What is the linear regression process in short?

A

Get sample data

Design a model that works for that sample

Make predictions for the whole population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain dependent and independent variables

A

There’s a dependent variable labeled Y (predicted)

And independent variables labeled x1, x2, …, xk (predictors)

Y = F(x1, x2, …, xk)

The dependent variable Y is a function of the independent variables x1 to xk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the coefficients in y = ß0 + ß1x1 + ε

A

These are the ß values in the model formula

ß1 –> Quantifies the effect of x on y –> Differs per country for example (income example)

ß0 –> Constant –> Like minimum wage in the income vs education example

ε –> Represents the error of estimation –> Is 0 on average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Easiest regression model in formula?

A

Simple linear regression model

y = ß0 + ß1x1 + ε

y –> Variable we are trying to predict –> Dependent variable

x –> independent variable

Goal is to predict the value of y provided we have the value of x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the sample data equivalent of the simple linear regression equation?

A

ŷ = b0 + b1x1

ŷ –> Estimated / Predicted value

b1 –> Quantifier

x1 –> Sample data for independent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Correlation vs. Regression

A

Correlation
Measures relationship between two variables
Movement together
Formula interchangeable –> p(x,y) = p(y,x)
Graph –> Single point

Regression
One variable affects the other or what changes it causes to the other
Cause and effect
Formula –> One way
Graph –> Line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What can you do with the following modules:

numpy
pandas
scipy
statsmodels.api
matplotlib
seaborn
matplotlib
sklearn

A

numpy
Working with multi-dimensional arrays

pandas
Enhances numpy
Organize data in tabular form
Along with descriptive data

scipy
numpy, pandas and matplotlib are part of scipy

statsmodels.api
Build on top of numpy and scipy –> Regressions and statsmodels

matplotlib
2D plotting specially designed for visualization of numpy computations

seaborn
Python visualization library based on matplotlib

sklearn
scikit learn –> Machine learning libraries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Check which packages are installed

A

Anaconda Navigator

CMD.exe Prompt

Write “conda list”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Upload and access data

A

Put the csv file in the same folder as the notebook file

data = pd.read_csv(‘filename’)

write ‘data’ –> The data will show up in the output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Pull up statistical data of your dataset

A

data.describe()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the steps for plotting regression data?

A

Import relevant libraries
Load the data
Declare the dependent and the independent variables
Explore the data
Regression itself
Plot the regression line on the initial scatter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to find ß0?

A

Read from the tables to find the numbers for plotting the scatter line

coef

const –> ß0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to determine whether variable is significant?

A

Hypothesis testing based on H0: ß=0
In the results table this is the same as t + P>|t|
p-value < 0.05 means that the variable is significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What to do if ß0 is not significanty different from 0

A

They are not calculated in the formula and thus left out of the prediction of the expected value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explore determinants of a good regression

A

Sum of squares total (SST or TSS)

Sum of squares regression (SSR)

Sum of squares error (SSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sum of squares total (SST) formula plus meaning?

A

∑(yi - ȳ)²

Can think this as the dispersion of the variables around the mean

Measures the total variability of the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Sum of squares regression (SSR) formula plus meaning?

A

∑(ŷi - ȳ)²

Sum of predicted value minus mean of dependent variable, squared

Measures how well the line fits the data

If SSR = SST –> Then the regression line is perfect meaning all the spots are ON the line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Sum of squares error (SSE) formula plus meaning?

A

∑e(i)²

Difference observed value and the predicted value

The smaller the error, the better the estimation power of the regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the connection between SST, SSR & SSE

A

SST = SSR + SSE

In words: The total variability of the dataset is equal to the variability explained by the regression line plus the unexplained variability (error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is OLS?

A

OLS –> Ordinary least squares

Most common method to estimate the linear regression equation

Least squares stands for minimizing the SSE (error) –> Lower error –> better explanatory power

OLS is the line with the smallest error –> Closest to all points simultaneously

There are other methods to calculate regression. OLS is simple and powerful enough for most problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is R-squared and how to interpret it?

A

R-squared –> How well your model fits your data

Intuitive tool when in the right hands
R² : SSR / SST
R² = 0 –> Regression explains NONE of the variability
R² = 1 –> Regression explains ALL of the variability

What you will typically observe is values from 0.2 til 0.9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a good R squared?

A

No rule of thumb!

Depends on the context and complexity of the topic whether the number is a strong indicator

With a mediocre score of R, one might need additional indicators to explain the correlation

The more factors you include in your regression –> The higher the R squared

23
Q

Why multiple regressions?

A

Good models require multiple regressions in order to address the higher complexity of problems

Population (mulitple regression) model

More independent variables (more than one)
ȳ –> Inferred value
b0 –> Intercept
x1…xk –> Independent variable
b1…bk –> coefficient

ȳ = b0 + b1x1 + b2x2 + b3x3 + … + bkxk

24
Q

What is the goal of building a model?

A

Goal is to limit the SSE as much as possible

With each additional variable we increase the explanatory power!

25
Q

What is adjusted R-Squared?

A

Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases when the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected.

The R-squared measures how much of the total variability is explained by our model

Multiple regressions are always better than simple ones –> As explained more variables lead to a better explanatory power

Denoted as: r̄²

26
Q

How is the adjusted R squared formula build up, with which variables?

A

1 - (1 - R-squared) * ((n - 1)/(n - p - 1))

So:
R-squared
n = Total sample size
p = number of predictors

27
Q

How big is Adjusted R squared compared to R squared?

A

r̄² is always smaller than R²

It penalizes excessive use of variables

28
Q

How to use p-value to determine if a variable should stay?

A

It’s only significant when p-value < 0.05

29
Q

How to interpret Adjusted R squared?

A

Check before and after if Adjusted R-squared was lowered or highered

30
Q

What are the consequences of adding useless data?

A

The formula changes –> Different number for the intercept number of ß0

Thus the bias of this useless variable is reflected into the coefficients of the others

31
Q

What is the simplicity/explanatory power tradeoff

A

Simplicity is better rewarded than having a high explanatory power!

32
Q

What is the F-statistic and how is it used?

A

F-statistic > Follows an F-distribution
It is used for testing the overall significance of the model

F-test
Null Hypothesis is that all betas are equal to 0 –> H0: ß1 = ß2 = ß3 = 0
H1: at least one ßi≠0

If all Beta’s are 0 than the model is useless

Compare F-statistic with or without variable –> Lower F-statistic means closer to a non-significant model

Prob(F-statistic) can still be significant but notice the change –> If it’s higher then drop the variable

33
Q

Interpretation of F-statistic?

A

Prob(F-statistic) very low –> We say overall model is significant

The lower –> The closer to a non-significant model

Don’t forget to look for the 3 zeroes after the dot

34
Q

How to verify linearity?

A

Plot the data –> If data points form something that can be explained as a straight line –> Then linear regression is suitable

35
Q

Regression assumption:
Explain Endogeneity

A

σxε = 0 : ∀x, ε

The error (difference observed and predicted values) is correlated with the independent variable –> This is problem referred to with ‘omitted variable bias

36
Q

Explain Omitted variable bias

A

In general
Omitted variable bias occurs when you forget to include a variable. This is reflected in the error term as the factor you forgot about is included in the error. In this way, the error is not random but includes a systematic part (the omitted variable).

You either include or omit the X variable leading to a difference in error –> Therefore the x and ε are somewhat correlated

37
Q

Regression assumption:
Explain Normality and homoscedasticity

A

ε ~ N(0, σ²)

Comprises:
Normality –> We assume the error term is normally distributed

Zero mean

Homoscedasticity

38
Q

When in doubt about including variable, what should you do?

A

Just include the variable –> Worst thing that can happen is that it leads to inefficient estimates

You can then immediately drop that variable

Leaving out a great variable does a lot more harm!

39
Q

What if error term is not normally distributed in Normality and homoscedasticity?

A

CLT Applies

Remember:

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

40
Q

What does diffusing mean?

A

Diffusing means the correlation is there for lower values but not for higher values –> We don’t like this pattern –> Heteroscedasticity

41
Q

Example of heteroscedasticity?

A

Poor person will have the same dinner every day –> Low variability

Rich person will eat out and then dine in the next day –> High variability thus we expect heteroscedasticity

42
Q

How to prevent heteroscedasticity?

A

Check for omitted variable bias:OVS

Look for outliers

Log Transform –> A statistician’s best friend

43
Q

Apply logarithmic axes

A

By changing the scale of X, reduces the width of the graph
New model is called: Semi-log model

Denoted as:
ŷ = b0 + b1(log x1)

or:
log ŷ = b0 + b1x1

Meaning: As X increases by 1 unit, Y increases by b1 percent

44
Q

Log-log model?

A

When using log on both axes:
log ŷ = b0 + b1(logx1)

Interpretation: As X increases by 1 percent, Y increases by b1 percent

Relation is known as elasticity

45
Q

Regression assumption: Explain ‘No autocorrelation’

A

Autocorrelation is a mathematical representation of the degree of similarity between a given time series and a lagged version of itself over successive time intervals. It’s conceptually similar to the correlation between two different time series, but autocorrelation uses the same time series twice: once in its original form and once lagged one or more time periods.

No autocorrelation
σ(εiεj) = 0 : ∀i ≠ j

Errors are assumed to be uncorrelated
Highly unlikely to find it in cross sectional data
Very common in time-series data such as stock prices

46
Q

Spot autocorrelation

A

Look at the graph –> If you can’t find any, you are safe

Durbin-Watson test

Generally its values fall between 0 and 4

2 –> No autocorrelation

<1 and >3 are a cause for alarm

Conclusion: When in the presence of autocorrelation avoid the linear regression model

47
Q

Regression assumption: No Multicollinearity

A

ρ(xixj) ≈ 1 : ∀i,j; i ≠ j

Is observed when 2 or more variables have a high correlation among each other

Example: a = 2 + 5 * b

In this case there is no point in using both a and b because they are correlated

ρ(cd) = 0.9 –> Imperfect multicollinearity

48
Q

How to deal with catagorical data?

A

Use a dummy instead and explain it later on –> Transform yes and no into 1 and 0

You can do calculations with 0 and 1

49
Q

What will the categorical data graph look like?

A

You will get two different models:
ȳ = b0 + b1x1 + b20 = b0 + b1x1
ȳ = b0 + b1x1 + b2
0 = b0 + b1x1 + b2 = (b0+b2) + b1.x1

Results in two lines with equal slope but different intercept

50
Q
A
51
Q
A
52
Q
A
53
Q
A