Regression diagnostics / Logistic regression Flashcards

(38 cards)

1
Q

Why do we need to consider the assumptions for a linear regression?

A

So we can rely on those statistics: coefficients, SE, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which regression assumptions are there?

A

differs a bit, but roughly:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What happens if you violate regression assumptions?

A

1) Coeffcients become unreliable –> biased
2) SE become unreliable –> any hypothesis becomes unreliable (incluing p-value/t-stat, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Linearity

A

Assumption: the average outcome is linearly related to each term in the model when holding all others fixed –> & technically, the “linear” in “linear regression” refers to the outcome being linear in the parameters, the β’s

Problem: Biased coefficient (true form is curvilinear)

Diagnostic: Component-plus-residual plot (A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable)

Solution: Polynomial, Spline, Collapse into categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Homoskedastic / Normally distributed Residuals

A

Assumption: constant & normal variance of residuals

Problem: Standard errors usually not correct: underestimated, also, influential observations may be present can also effect coefficients

Diagnostic: Heteroskedastic –> Plot residuals, Normallity –> HIstograms, Qnorm, Studentized residuals plot

Solution: Log-transformation / Power-transformation, Robust SE‘s, Correct coding errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

No multicollinearity

A

Assumption: Predictors should be independent of each other, very low correlated (not present in SLR but in MLR)

Problem: “holding constant” not possible with correlated variables –> 1) interpretation becomes impossible, also model will not know which varible made difference 2) Loss of precision (inflated standard errors)

Diagnostic: Look at correlations, Variance inflation factor (assesses each variable, what’s the difference in variance if we include/exclude it –> the higher the VIF, the more information is already contained = high multi-collinearity)

Solutions: Get crafty (similar but not collinear variable), Construct an index, Get more data, Mean-center interaction variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which diagnostic is important to consider apart from regression assumptions?

A

Influential observations

  • pull the regression fit towards themselves –> results (predictions, parameter estimates, CIs, p-values) can be quite different with and without these cases included in the analysis
  • do not necessarily violate any regression assumptions, they can cast doubt on the conclusions drawn from your sample. If a regression model is being used to inform real-life decisions, one would hope those decisions are not overly influenced by just one or a few observations…
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is more important: Coefficient vs SE?

A

First, estimation then SE, correct SE no use if estamtion is biased ..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which one is problematic?

A

B - unusual value and large residual ergo leverage which pulls regression line down, deleting it would change regression line drastically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Are these two influential observations?

A

NO
1) large sample
2) no unusual x-value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When to delete an outlier?

A

Cook’s D of 1 approx.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
A

not a problem of bias per se but a lack of data, very little variance + hard to separate makes it hard to be precise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
A

a) hard decision to make

17
Q

Potential problems, diagnostics, potential solutions: Influential
observations

18
Q

Which is the link function?

OLS vs Logistic Regression

A

Linear = Identity link (just the mean –> model of the mean)
Logistic = Logit link (log odds of mean –> model of the log odds /also: logit)

19
Q

Which distribution?

OLS vs Logistic Regression

A

Linear = Gauss (continuos)
Logistic = Binomial (discrete)

20
Q

What are the problems with an OLS model when it comes to binary outcomes

A

1) binary outcome –> we want to model probability & linear model can give neg values but neg probabilities no meaning
2) unrealistic assumptions about constant effects
3) normal residual assumption is violated

21
Q

Interpret OLS model

A

1 unit increase in trust scale, decreases the mean of AFD vote by 0.05 percentage points

22
Q

What does the logistic regression model?

23
Q

How does ML work?

A

iterates to find best parameters

24
Q

How to interpret the Log Odds?

A

very hard, not intuitive ..

25
How to interpret the Odds Ratio?
1 --> no difference
26
Limitation of Odds
We can only say what the odds and how odds increase BUT NOT how likely something actuallly is (probabilites)
27
How to get back to probabilities?
e to the power of function Example: plug in 0 for men --> constant = 10% plug in 1 for women --> constant - coefficient = 5%
28
What is the average marginal effect?
Averages all the marginal effects of the dimensions of the explanatory variable-
29
How do the average marginal effect and linear regression relate to each other?
very similar, lin reg approx. AME
30
How to model this relationship with a linear model?
Polynomial of Trust in Parliament would pick up the slowing of the decline
31
How to make this log reg table more interpretable?
Transform to probability scale (marginy, dydx) --> Average marginal effects (makes it more interpretable but also takes out the s-shape meaning it works well in the middle but less at the edges)
32
Polynomials vs log regression
decline in non lin models
33
What is the problem with interactions in logistic regressions?
log regression already uses interaction even if it is not explicit
34
What is the problem with mediation in logistic regressions?
if you put in more, logistic mediation will be different, no cross-country comparison possible --> mediation generally underestimated
35
What is the solution for logistic mediation?
KHB method (can also be used for linear models) --> Diff gives you mediation effect
36
Important to consider the absolute as well as the relative ratio
37
Logistic vs linear models | Binary variables
38
Why would we prefer logistic regression sometimes?
Captures non-linear relationship really well, with OLS we need to model them ourselves