advanced topics Flashcards
logistic regression
predicts the probability of y P(y) from our xs
P(yi) = 1/1+ exponential to the power of (β0 + β1 + xi)
what is probability
range from 0 to 1
binary outcomes
binary variables = type of categorical variable with only two levels
we code them 0 and 1 in terms of whether an event did or did not happen - this is NOT the same as dummy coding
what are odds
odds of an event occurring = the ratio of it occurring : it not occurring
odds can only ever be a positive value
odds = probability/(1-probability)
what are log odds
natural log of the odds - when plotted the log odds look linear and is a continuous DV
logodds = ln[P(y=1)/1-p(y=1)]
logodds above +4 and below -4 are considered 100% - since 0 is in the middle of these 0 = 50%
maximum likelihood estimation
MLE is used to estimate logistic regression models as MLE finds the logistic regression coefficients that maximise the likelihood of the observed data having occurred.
MLE minimises log-likelihood (indicates a better model)
evaluating logistic regression models
compare our model to a null model (with no predictors) and assess the improvement in fit
we compare our model to our baseline model using deviance
- deviance = -2 * loglikelihood (aka -2LL)
we calculate the difference in deviances between our model and the baseline and use p-values to assess significance
generalised linear model
in R this is the glm() function used to conduct logistic regression it uses the same format as lm() but with the addition of family = “ “ to determine what kind of regression we what/how the data will be distributed
binomial distribution
a discrete probability distribution
probability mass function
probability that a discrete random variable is exactly equal to some value
f(k,n,p) = Pr(x=k) = (n choose k) * p to power k * q to power n-k
where:
- k = number of successes
- n = number of trials
- p = probability of successes
- q = probability of failure (1-p)
interpreting glm() output
computation of residuals is different now we’re dealing with deviance (rather than variance) - a model with less residual deviance is better.
out β coefficients for the IV are the change in logodds of y for each unit increase of x
what is odds ratio
logodds don’t provide interpretable results therefore, the β coefficients from logodds are converted to odds ratios which are easier to interpret.
odds ratio is obtained by exponentiating the β coefficients
interpreting odds ratio
1 = no effect (50%)
<1 = negative effect - e.g. 0.8 = decrease in odds
>1 = positive effect - e.g. 1.2 = increase in odds
likelihood ratio test
method of logistic model comparison = tests if model line used is the best line to maximise likelihood
- alternative to z-test but can only be used for nested models (non-nested need AIC/BIC)
z-test
tests the statistical significance of predictors (can be prone to type 2 errors)
z = β / SE(β)
power analysis
power is the probability of CORRECTLY detecting an effect that exists - tells us what percentage of the time we would reject the null
power = 1 - β (NOT THE SAME β AS IN A REGRESSION)
power depends on:
- sample size
- effect size
- significance level
conventional value for power
0.8
power calculations in R
use the pwr package
examples:
t test
pwr.t.test( n = group size, d = effect size, sig.level = 0.05, power = 0.8, type = “two.sample”, alternative = “greater”)
- this is just an example so values may differ and not all of the above things may be included
correlation
pwr.r.test
- basically the same as above but d becomes r (corelation coefficient)
f-tests
pwr.f.test( u = k, v = (n-k-1), f2 = effect size, sig.level = 0.05, power = 0.8)
- again just an example so there will be actual number when i’ve just put general symbols
what is causality??
one event directly leads to another
- this does not have to be a direct 1:1 relationship
conditions for causality
- covariance = two variables change together
- plausibility = does the relationship make sense
- temporal precedence = if A causes B then A must always occur before B
- no reasonable alternative other than A causes B
testing causality
identifying causal relationships is often possible through study design rather than statistical tests - it is harder to do this with observational studies but we can use:
- propensity score matching (simulated control group)
- instrumental variable analysis (simulates the effect of randomly assigning people to groups)
… to make causal claims from observational data
endogenity
a condition that effects our ability to make a causal claim.
- theoretically = occurs when the marginal distribution of a predictor variable is not independent of the conditional distribution of the outcome variable given the predictor variable
- practically = occurs when a predictor variable is correlated with the error term (causing bias in our β coefficients)
problems with endogenity
- can’t easily tell if our variables are endogenous (both x and error are correlated)
- even if you successfully identify endogeneity in your model you must determine why it is there to solve the problem
sources of endogeneity: simultaneity bias
causality goes both ways (x causes y, y causes x)
- solution = use statistical models developed specifically for this (e.g. two way least squares regression)