Logistic Regression

Models of discrete choice have been a topic in (Micro) Econometrics and are nowadays widely used in Marketing research.

Logit, and probit models extend the principles of general linear models (ex., regression) to better treat the case of dichotomous and categorical target variables.

They focus on categorical dependent variables, looking at all levels of possible interaction effects.

McFadden got the 2000 Nobel price in Economics for fundamental contributions in discrete choice modeling.
Models of discrete choice have been a topic in (Micro) Econometrics and are nowadays widely used in Marketing research.
Logit, and probit models extend the principles of general linear models (ex., regression) to better treat the case of dichotomous and categorical target variables.
They focus on categorical dependent variables, looking at all levels of possible interaction effects.
McFadden got the 2000 Nobel price in Economics for fundamental contributions in discrete choice modeling.
Application of Logistic Regression

Why do commuters choose to fly or not to fly to a destination when there are alternatives.

Available modes = Air, Train, Bus, Car

Observed:

Choice

Attributes: Cost, terminal time, other ■Characteristics of commuters: Household income

Choose to fly iff U_{fly} > 0

U_{fly} = β_{0}+β_{1}Cost + β_{2}Time + γIncome + ε
Why do commuters choose to fly or not to fly to a destination when there are alternatives.
Available modes = Air, Train, Bus, Car
Observed:

Choice

Attributes: Cost, terminal time, other ■Characteristics of commuters: Household income
Choose to fly iff U_{fly} > 0

U_{fly} = β_{0}+β_{1}Cost + β_{2}Time + γIncome + ε
The Linear Probability Model

The predicted probabilities of the linear model can be
greater than 1 or less than 0

ε is not normally distributed because ! takes on only two values

The error terms are heteroscedastic
The predicted probabilities of the linear model can be
greater than 1 or less than 0
ε is not normally distributed because ! takes on only two values
The error terms are heteroscedastic
GaussMarkovAssumptions
 The OLS estimator is the best linear unbiased estimator (BLUE), iff
 there is a linear relationship between predictors x and y
 the error variable is a normally distributed random variable with E(ε)=0.
 the error variance is constant for all values of * (homoscedasticity).
 The errors ε are independent of each other.
 No multicollinearity among predictors (i.e., high correlation).
 there is a linear relationship between predictors x and y
 the error variable is a normally distributed random variable with E(ε)=0.
 the error variance is constant for all values of * (homoscedasticity).
 The errors ε are independent of each other.
 No multicollinearity among predictors (i.e., high correlation).
The Logistic Regression Model
 The "logit" model solves the problems of the linear model:
 ln[p/(1p)] = β_{0} + β_{1}X_{1} + ε
 p is the propability that the event Y occurs, Pr(Y= 1  X_{1})
 p/(1  p) describes the odds
 The 20% propability of winning describes odds of 0.20/0.80=0.25
 A 50% chance of winning leads to odds of 1
 ln[p/(1p)] is the log odds, or "logit"
 p = 0.50, then logit = 0
 p = 0.70, then logit = 0,84
 p = 0.30, then logit = 0,84
 ln[p/(1p)] = β_{0} + β_{1}X_{1} + ε
 The 20% propability of winning describes odds of 0.20/0.80=0.25
 A 50% chance of winning leads to odds of 1
 p = 0.50, then logit = 0
 p = 0.70, then logit = 0,84
 p = 0.30, then logit = 0,84
Logistic Function

The logistic function Pr1!(3 constrains the estimated probabilities to lie between 0 and 1 (0 <= Pr(Y  X) <= 1).

Pr(Y  X) = e^{β0+β1X1} / (1 + e^{β0 +β1X1})

Pr(Y  X) is the estimated probability that the ith case is in a category and β_{0} + β_{1}X_{1} is the regular linear regression equation

This means that the probability of a success (Y = 1) given the predictor variable (X) is a nonlinear function, specifically a logistic function

if you let β_{0} +β_{1}X_{1} =0,then p = .50

as β_{0} + β_{1}X_{1} gets really big, p approaches 1

as β_{0} + β_{1}X_{1} gets really small, p approaches 0

The values in the regression equation β_{1} and β_{0} take on slightly different meanings.

β_{0}

β_{1}

β_{0}/β_{1 }
The logistic function Pr1!(3 constrains the estimated probabilities to lie between 0 and 1 (0 <= Pr(Y  X) <= 1).

Pr(Y  X) = e^{β0+β1X1} / (1 + e^{β0 +β1X1})
Pr(Y  X) is the estimated probability that the ith case is in a category and β_{0} + β_{1}X_{1} is the regular linear regression equation
This means that the probability of a success (Y = 1) given the predictor variable (X) is a nonlinear function, specifically a logistic function

if you let β_{0} +β_{1}X_{1} =0,then p = .50

as β_{0} + β_{1}X_{1} gets really big, p approaches 1

as β_{0} + β_{1}X_{1} gets really small, p approaches 0
The values in the regression equation β_{1} and β_{0} take on slightly different meanings.

β_{0}

β_{1}

β_{0}/β_{1 }
Odds and Logit
By algebraic manipulation, the logistic regression equation can be written in terms of an odds of success:
 p/(p1) = e^{β0+β1X1 }
 Odds range from 0 to positive infinity
 If p/(p1) is
 less than 1, then less than .50 probability

greater than 1, then greater than .50 probability
The Logit
Finally, taking the natural log of both sides, we can write the equation in terms of logits (logodds):
 Probability is constrained between 0 and 1
 Logodds are a linear function of the predictors
 Logit is now between∞ and+∞(asthe dependent variable of a linear regression)
 The regression coefficients go back to their old interpretation (kind of)
 The amount the logit (logodds) changes, with a one unit change in (1
Estimating the Coefficients of a Logistic Regression
 Maximum Likelihood Estimation (MLE) is a statistical method for estimating the coefficients of a model
 The likelihood function (l) measures the probability of observing the particular set of dependent variable values that occur in the sample
 MLE involves finding the coefficients that makes the log of the likelihood function (ll < 0) as large as possible
The Likelihood Function for Logit Model
 Suppose 10 individuals make travel choices between auto (A) and public transit (T).
 All travelers are assumed to possess identical attributes (unrealistic), and so the probabilities are not functions of β's but simply a function of p, the probability p of choosing auto.

■l = p^{x} (1  p)^{nx} = p^{7} (1p)3
■ ln(l) = 7ln(p) + 3ln(1p), maximized at 0.7

■l = p^{x} (1  p)^{nx} = p^{7} (1p)3
■ ln(l) = 7ln(p) + 3ln(1p), maximized at 0.7
Evaluating the Logistic Regression
 The log likelihood function (ll) is one metric to compare two logistic regression models (the higher, the better)
 Also AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) measure the goodnessoffit
 There are several measures intended to mimic the R2 analysis (PseudoR2, e.g., McFaddenR2 or NagelkerkeR2), but the interpretation is different
 A Wald test or ttest is used to test the statistical significance of each coefficient in the model hypothesis that βi=0
 The ChiSquare statistic and associated /value shows whether
 the model coefficients as a group equal zero ( Group :
 Larger Chisquares and smaller pvalues indicate greater confidence in rejecting the null hypothesis of no
 Use also error rates and gain curves to evaluate the performance
 Also AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) measure the goodnessoffit
 Larger Chisquares and smaller pvalues indicate greater confidence in rejecting the null hypothesis of no
McFadden R2 / Pseudo R2
R^{2}_{McFadden }= 1  (ll)/(ll_{0})

If the full model does much better than just a constant, in a discretechoice model this value will be close to 1.

1(80.9658/123.757)=0.3458 for the logit model on the previous to last slide


If the full model doesn’t explain much at all, the value will be close to 0.

Typically, the values are lower than those of R2 in a linear regression and need to be interpreted with care.

>0.2 is acceptable, >0.4 is already ok

Calculating Error Rates from a Logistic Regression
 Assume that if the estimated / is greater than or equal to .5 then the event is expected to occur and not occur otherwise.
 By assigning these probabilities 0s and 1s and comparing these to the actual 0s and 1s, the % correct Yes, % correct No, and overall % correct scores are calculated.
Simple Interpretation of the Coefficients

If β_{1} <0 then an increasein X_{1} =>(0 < exp(β_{1}) < 1)

then odds go down

If β_{1} > 0 then an increase in X_{1} => (exp(β_{1}) > 1)

then odds go up

Always check for the significance of the coefficients

But can we say more than this when interpreting the coefficient values?
If β_{1} <0 then an increasein X_{1} =>(0 < exp(β_{1}) < 1)

then odds go down
If β_{1} > 0 then an increase in X_{1} => (exp(β_{1}) > 1)

then odds go up
Always check for the significance of the coefficients
But can we say more than this when interpreting the coefficient values?
Multicollinearity and Irrelevant Variables
 The presence of multicollinearity will not lead to biased coefficients, but it will have an effect on the standard errors.
 If a variable which you think should be statistically significant is not, consult the correlation coefficients.
 If two variables are correlated at a rate greater than .6, .7, .8, etc. then try dropping the least theoretically important of the two.
 The inclusion of irrelevant variables can result in poor model fit.
 You can consult your Wald statistics and remove irrelevant variables.
 If a variable which you think should be statistically significant is not, consult the correlation coefficients.
 If two variables are correlated at a rate greater than .6, .7, .8, etc. then try dropping the least theoretically important of the two.
 You can consult your Wald statistics and remove irrelevant variables.
Multiple Logistic Regression
 More than one independent variable
 Dichotomous, ordinal, nominal, continuous ...
ln(p/(1p)) = β_{0 + }β_{1}X_{1} + β_{2}X_{2 ... + }β_{n}X_{n}

Interpretation of β_{i}

_{}Increase in logodds for a one unit increase in xi with all the other x_{i}s constant
p(Y=1) = 1/ (1+e^{(β0+β1X1+...+βN+XN)})

Effect modification

Interaction effectsxcn be modelled by including interaction terms, e.g. the interaction effect of age and income
ln(p / (1p) = β_{0 + }β_{1}X_{1} + β_{2}X_{2 + }β3X1 X X_{2}

_{}Discrete choice models take many forms, including:

Binary logit, multinomial logit, conditional logit (variables vary over alternatives), ordered logit (good/bad/ugly), etc.
 Dichotomous, ordinal, nominal, continuous ...
ln(p/(1p)) = β_{0 + }β_{1}X_{1} + β_{2}X_{2 ... + }β_{n}X_{n}
Interpretation of β_{i}

_{}Increase in logodds for a one unit increase in xi with all the other x_{i}s constant
p(Y=1) = 1/ (1+e^{(β0+β1X1+...+βN+XN)})
Effect modification

Interaction effectsxcn be modelled by including interaction terms, e.g. the interaction effect of age and income
ln(p / (1p) = β_{0 + }β_{1}X_{1} + β_{2}X_{2 + }β3X1 X X_{2}
_{}Discrete choice models take many forms, including:

Binary logit, multinomial logit, conditional logit (variables vary over alternatives), ordered logit (good/bad/ugly), etc.
Multinomial Logit Models

The dependent variable, Y, is a discrete variable that represents a choice, or category, from a set of mutually exclusive choices or categories.

Examples are brand selection, transportation mode selection, etc.

Still the residuals need to be i.i.d

Model:

Choice between J > 2 categories

Dependent variable y = 1,2,3, ... J

If characteristics that vary over alternatives (e.g., prices, travel distances, etc.), the multinomial logit is often called “conditional logit”.
The dependent variable, Y, is a discrete variable that represents a choice, or category, from a set of mutually exclusive choices or categories.

Examples are brand selection, transportation mode selection, etc.

Still the residuals need to be i.i.d
Model:

Choice between J > 2 categories

Dependent variable y = 1,2,3, ... J
If characteristics that vary over alternatives (e.g., prices, travel distances, etc.), the multinomial logit is often called “conditional logit”.
Generalized Linear Models (GLM)
 The models in this class are examples of generalized linear models
 GLMs are a general class of linear models that are made up of three components: Random, Systematic, and Link Function
 Random component: Identifies dependent variable (Y) and its probability distribution
 Systematic Component: Identifies the set of explanatory variables 1(1,...,(&3
 Link Function: Identifies a function of the mean that is a linear function of the explanatory variables
 g(μ)=α+β_{1}X_{1} + ... +β_{k}X_{k}
 Link function:

Identity link (form used in normal regression models): g(μ) = μ

Log link (used when μ cannot be negative as when data are Poisson counts): g(μ) = log(μ)

Logit link (used when μ is bounded between 0 and 1 as when data are binary): μ
g(μ) = log(μ / 1− μ)
 Random component: Identifies dependent variable (Y) and its probability distribution
 Systematic Component: Identifies the set of explanatory variables 1(1,...,(&3
 Link Function: Identifies a function of the mean that is a linear function of the explanatory variables
 g(μ)=α+β_{1}X_{1} + ... +β_{k}X_{k}

Identity link (form used in normal regression models): g(μ) = μ

Log link (used when μ cannot be negative as when data are Poisson counts): g(μ) = log(μ)

Logit link (used when μ is bounded between 0 and 1 as when data are binary): μ
g(μ) = log(μ / 1− μ)
Count Variables as Dependent Variables
 Many dependent variables are counts: Nonnegative integers
 # Crimes a person has committed in lifetime
 # Children living in a household
 # new companies founded in a year (in an industry)
 # of social protests per month in a city

Count variables can be modeled with OLS regression... but:

1. Linear models can yield negative predicted values... whereas counts are never negative

2. Count variables are often highly skewed

Ex: # crimes committed this year... most people are zero or very low; a few people are very high

Extreme skew violates the normality assumption of OLS regression.
 # Crimes a person has committed in lifetime
 # Children living in a household
 # new companies founded in a year (in an industry)
 # of social protests per month in a city
Count variables can be modeled with OLS regression... but:

1. Linear models can yield negative predicted values... whereas counts are never negative

2. Count variables are often highly skewed

Ex: # crimes committed this year... most people are zero or very low; a few people are very high

Extreme skew violates the normality assumption of OLS regression.

Count Models
 Two most common count models:
 Poisson regression model (aka. loglinear model)
 Negative binomial regression model
 Both assume the observed count is distributed according to a Poisson distribution:
 μ = expected count (and variance)
 y = observed count

P(y  μ)= e^{−μ}μ^{y} / y!
 Poisson regression model (aka. loglinear model)
 Negative binomial regression model
 μ = expected count (and variance)
 y = observed count
P(y  μ)= e^{−μ}μ^{y} / y!
Poisson Regression for Count Data
 Strategy: Model log of μ as a function of (s
 Quite similar to modeling log odds in logit
 Again, the log form avoids negative values

ln(μ) = ∑(k über j=1) β_{j} X_{ji}

Which can be written as:

μ = e^{∑βj Xji}

^{}Distribution: Poisson (Restriction: E(Y) = V(Y))

When the mean and variance are not equal (overdispersion), often the Poisson distribution is replaced with a negative binomial distribution

Link Function: Can be identity link, but typically use the log link:

g(μ)=ln(μ)=β_{0} +β_{1}X_{1} +...+β_{k}X_{k}

μ(X_{1}...X_{k}) = e^{β0 +β1X1 +...+βkXk}
 Quite similar to modeling log odds in logit
 Again, the log form avoids negative values

ln(μ) = ∑(k über j=1) β_{j} X_{ji}
Which can be written as:

μ = e^{∑βj Xji}
^{}Distribution: Poisson (Restriction: E(Y) = V(Y))

When the mean and variance are not equal (overdispersion), often the Poisson distribution is replaced with a negative binomial distribution
Link Function: Can be identity link, but typically use the log link:

g(μ)=ln(μ)=β_{0} +β_{1}X_{1} +...+β_{k}X_{k}

μ(X_{1}...X_{k}) = e^{β0 +β1X1 +...+βkXk}
Interpreting Coefficients Poisson regression

In Poisson Regression, + is typically conceptualized as a rate...

Positive coefficients indicate higher rate; negative = lower rate

Like logit, Poisson models are nonlinear

Coefficients don’t have a simple linear interpretation

Like logit, model has a log form; exponentiation aids interpretation

Exponentiated coefficients are multiplicative

Analogous to odds ratios... but called “incidence rate ratios”.

Exponentiated coefficients: indicate effect of unit change of X on rate

e^{b }= 2.0 indicates that the rate doubles for each unit change in X

e^{b} = 0.5 indicates that the rate drops by half for each unit change in X

Recall: Exponentiated coefs are multiplicative

If)* e^{b }= 5.0 a 2point change in X isn’t 10; it is 5*5 = 25

Also: you must invert to see opposite effects

If) e^{b} = 5.0 a 2point decrease in X isn’t5, it is 1/5 =0.2

Again, exponentiated coefficients (rate ratios) can be converted to % change

Formula: (e^{b } 1) ∗ 100%

Coefficent = 0.693

(e^{0.693}  1) * 100% = 50% decrease in rate
In Poisson Regression, + is typically conceptualized as a rate...

Positive coefficients indicate higher rate; negative = lower rate
Like logit, Poisson models are nonlinear

Coefficients don’t have a simple linear interpretation
Like logit, model has a log form; exponentiation aids interpretation

Exponentiated coefficients are multiplicative

Analogous to odds ratios... but called “incidence rate ratios”.
Exponentiated coefficients: indicate effect of unit change of X on rate

e^{b }= 2.0 indicates that the rate doubles for each unit change in X

e^{b} = 0.5 indicates that the rate drops by half for each unit change in X
Recall: Exponentiated coefs are multiplicative

If)* e^{b }= 5.0 a 2point change in X isn’t 10; it is 5*5 = 25

Also: you must invert to see opposite effects

If) e^{b} = 5.0 a 2point decrease in X isn’t5, it is 1/5 =0.2

Again, exponentiated coefficients (rate ratios) can be converted to % change

Formula: (e^{b } 1) ∗ 100%

Coefficent = 0.693

(e^{0.693}  1) * 100% = 50% decrease in rate

Poisson Model Assumptions
 Poisson regression makes a big assumption: That variance of μ = μ (“equidisperson”)
 In other words, the mean and variance are the same
 This assumption is often not met in real data
 Dispersion is often greater than μ: overdispersion
 Consequence of overdispersion: Standard errors will be underestimated
 Potential for overconfidence in results; rejecting H0 when you shouldn’t!
 Note: overdispersion doesn’t necessarily affect predicted counts (compared to alternative models).

Overdispersion is most often caused by highly skewed dependent variables

Often due to variables with high numbers of zeros

Ex: Number of traffic tickets per year

Most people have zero, some can have 50!

Mean of variable is low, but SD is high

Other examples of skewed outcomes

# of scholarly publications

# cigarettes smoked per day

# riots per year (for sample of cities in US).
 In other words, the mean and variance are the same
 This assumption is often not met in real data
 Dispersion is often greater than μ: overdispersion
 Potential for overconfidence in results; rejecting H0 when you shouldn’t!
 Note: overdispersion doesn’t necessarily affect predicted counts (compared to alternative models).
Overdispersion is most often caused by highly skewed dependent variables
Often due to variables with high numbers of zeros

Ex: Number of traffic tickets per year

Most people have zero, some can have 50!

Mean of variable is low, but SD is high
Other examples of skewed outcomes

# of scholarly publications

# cigarettes smoked per day

# riots per year (for sample of cities in US).
General Remarks
 Poisson & negative binomial models suffer all the same basic issues as “normal” regression, and you should be careful about
 Model specification / omitted variable bias
 Multicollinearity
 Outliers/influential cases
 Also, it uses maximum likelihood
 N > 500 = fine; N
 Results aren’t necessarily wrong if N < 100;
 But it is a possibility; and hard to know when problems crop up
 Plus ~10 cases per independent variable.
 Model specification / omitted variable bias
 Multicollinearity
 Outliers/influential cases
 N > 500 = fine; N
 Results aren’t necessarily wrong if N < 100;
 But it is a possibility; and hard to know when problems crop up
 Estimate hours worked by employees and characteristics of employees such as age, education and family status. For unemployed people we do not have the number of hours they would have worked had they had employment.