advanced topics Flashcards

1
Q

logistic regression

A

predicts the probability of y P(y) from our xs

P(yi) = 1/1+ exponential to the power of (β0 + β1 + xi)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is probability

A

range from 0 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

binary outcomes

A

binary variables = type of categorical variable with only two levels

we code them 0 and 1 in terms of whether an event did or did not happen - this is NOT the same as dummy coding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are odds

A

odds of an event occurring = the ratio of it occurring : it not occurring
odds can only ever be a positive value
odds = probability/(1-probability)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what are log odds

A

natural log of the odds - when plotted the log odds look linear and is a continuous DV

logodds = ln[P(y=1)/1-p(y=1)]

logodds above +4 and below -4 are considered 100% - since 0 is in the middle of these 0 = 50%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

maximum likelihood estimation

A

MLE is used to estimate logistic regression models as MLE finds the logistic regression coefficients that maximise the likelihood of the observed data having occurred.
MLE minimises log-likelihood (indicates a better model)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

evaluating logistic regression models

A

compare our model to a null model (with no predictors) and assess the improvement in fit

we compare our model to our baseline model using deviance
- deviance = -2 * loglikelihood (aka -2LL)
we calculate the difference in deviances between our model and the baseline and use p-values to assess significance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

generalised linear model

A

in R this is the glm() function used to conduct logistic regression it uses the same format as lm() but with the addition of family = “ “ to determine what kind of regression we what/how the data will be distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

binomial distribution

A

a discrete probability distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

probability mass function

A

probability that a discrete random variable is exactly equal to some value

f(k,n,p) = Pr(x=k) = (n choose k) * p to power k * q to power n-k
where:
- k = number of successes
- n = number of trials
- p = probability of successes
- q = probability of failure (1-p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

interpreting glm() output

A

computation of residuals is different now we’re dealing with deviance (rather than variance) - a model with less residual deviance is better.
out β coefficients for the IV are the change in logodds of y for each unit increase of x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is odds ratio

A

logodds don’t provide interpretable results therefore, the β coefficients from logodds are converted to odds ratios which are easier to interpret.
odds ratio is obtained by exponentiating the β coefficients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

interpreting odds ratio

A

1 = no effect (50%)
<1 = negative effect - e.g. 0.8 = decrease in odds
>1 = positive effect - e.g. 1.2 = increase in odds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

likelihood ratio test

A

method of logistic model comparison = tests if model line used is the best line to maximise likelihood
- alternative to z-test but can only be used for nested models (non-nested need AIC/BIC)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

z-test

A

tests the statistical significance of predictors (can be prone to type 2 errors)

z = β / SE(β)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

power analysis

A

power is the probability of CORRECTLY detecting an effect that exists - tells us what percentage of the time we would reject the null

power = 1 - β (NOT THE SAME β AS IN A REGRESSION)

power depends on:
- sample size
- effect size
- significance level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

conventional value for power

A

0.8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

power calculations in R

A

use the pwr package

examples:
t test
pwr.t.test( n = group size, d = effect size, sig.level = 0.05, power = 0.8, type = “two.sample”, alternative = “greater”)
- this is just an example so values may differ and not all of the above things may be included

correlation
pwr.r.test
- basically the same as above but d becomes r (corelation coefficient)

f-tests
pwr.f.test( u = k, v = (n-k-1), f2 = effect size, sig.level = 0.05, power = 0.8)
- again just an example so there will be actual number when i’ve just put general symbols

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is causality??

A

one event directly leads to another

  • this does not have to be a direct 1:1 relationship
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

conditions for causality

A
  • covariance = two variables change together
  • plausibility = does the relationship make sense
  • temporal precedence = if A causes B then A must always occur before B
  • no reasonable alternative other than A causes B
21
Q

testing causality

A

identifying causal relationships is often possible through study design rather than statistical tests - it is harder to do this with observational studies but we can use:
- propensity score matching (simulated control group)
- instrumental variable analysis (simulates the effect of randomly assigning people to groups)
… to make causal claims from observational data

22
Q

endogenity

A

a condition that effects our ability to make a causal claim.

  • theoretically = occurs when the marginal distribution of a predictor variable is not independent of the conditional distribution of the outcome variable given the predictor variable
  • practically = occurs when a predictor variable is correlated with the error term (causing bias in our β coefficients)
23
Q

problems with endogenity

A
  1. can’t easily tell if our variables are endogenous (both x and error are correlated)
  2. even if you successfully identify endogeneity in your model you must determine why it is there to solve the problem
24
Q

sources of endogeneity: simultaneity bias

A

causality goes both ways (x causes y, y causes x)
- solution = use statistical models developed specifically for this (e.g. two way least squares regression)

25
Q

sources of endogeneity: omitted/confounding variables

A

when x is correlated with an omitted value (z), the variance explained by z falls on y and the residual error
- solution = ensure all potential confounds are measured and included in the model

26
Q

sources of endogeneity: measurement error

A

instead of measuring x, you measure x* (x with error included)
- solution = careful planning and study design

27
Q

interpolation

A

predicting a value from a model within the range of given data points
e.g. if your data spans 10 - 50 and using the data to predict someone with a value of 35

28
Q

extrapolation

A

using a model to predict a value outside of the range of given data
e.g. if your data spans 10 - 50 and you use it to predict someone with a value of 60 or 5

  • need to take caution when using extrapolation since we don’ t have data points on both sides of our predicted values, we don’t know for sure it follows a linear pattern (as we would predict)
29
Q

issues with missing data

A
  • loss of efficiency due to smaller n
  • bias (i.e. incorrect estimates)
30
Q

types of missing data: MAR

A

missing at random
- related to the predictors but not the outcome
“when the probability of missing data on variable X is related to other variables in the model but not the value of X itself”

challenge = no way to confirm there is no relation between the predictors and the missing data

31
Q

types of missing data: MCAR

A

missing completely at random
- genuinely random missingness, no relation between x/any other variable with the missingness of x
- effects all levels of our data equally/without bias

32
Q

types of missing data: MNAR

A

missing not at random
“when the probability of missingness on x is related to the values of x itself”

challenge = no way to verify MNAR without knowledge of the missing values

33
Q

methods of dealing with missing data: deletion methods

A

likewise deletion = delete everyone from the analysis with missing data
- NOT recommended - gives biased results

pairwise deletion = uses cases available for each analysis = different cases contribute to different correlation matrixes
- NOT recommended (but doesn’t reduce power as much as likewise)

34
Q

methods of dealing with missing data: imputation methods

A

mean imputation = replace missing values with the mean of that variable
- NOT recommended - artificially reduces variability and biased (probs worst method)

regression imputation = replace missing values with their predicted values from regression model
- ‘normal’ vs scholastic (adds a residual term to overcome loss of variance)

multiple imputation (MI) = imputes missing data several times to create complete data sets (results are pooled to get parameter estimates and SEs)
- recommended if data is likely to be MAR

35
Q

methods of dealing with missing data: maximum likelihood estimation (MLE)

A

estimation method = make use of all model information to arrive at the parameter estimate ‘as if’ the data was complete
- recommended if data likely to be MAR or MCAR

36
Q

methods of dealing with missing data: methods for MNAR

A

selection models = combines model for predicting missingness which adjusts the parameter estimates for the analysis models of interest
- often gives worse results than MLE or MI

pattern mixture model = stratifies the sample according to different missing data patterns and estimates the substantive model in each subgroup
- provides strong, untestable assumptions
- good to include as part of sensitivity analysis but often between to use MLE or MI

37
Q

exploratory analysis

A

used when we are interested in the relationship between variables but don’t have clear predictions about how they’re related/how to test them.
exploratory analysis can take many forms but share the common fact that the researcher doesn’t have specific predictions about the IV and DV

it is just done to learn about your data:
- focus on minimising prediction error
- data sets must be large enough to support training data
- estimate prediction error/assess model performance
- control bias-variance trade off

38
Q

overfitting

A

= the tendency for statistical models to fit sample specific noise as if it were signal
since noise is random, fitting a model to noise makes it bad at predicting a new dataset

39
Q

training data

A

= the data we ‘train’ our model with (the data used to fit the model line)

40
Q

test data

A

= data we use to test how well our trained model can predict

41
Q

p-hacking

A

= special (bad) case of overfitting that takes place prior to or in parallel with model estimation e.g. choosing which analysis to report (if data doesn’t fit, just remove it/ stargazing)

42
Q

what is bias

A

the tendency for a model to consistently produce answers that are wrong in a particular direction

43
Q

what is variance

A

the extent to which a model’s fitted parameters will tend to deviate from their central tendency across different datasets

44
Q

bias-variance trade off

A

ideally we want low variance, low bias but that is rare in science so we make trade offs

  • low bias, high variance = flexible data analysis (almost any pattern can be detected which can be risky) = exploratory data analysis
  • high bias, low variance = strict adherence to a fixed set of procedures (limited range of patterns identified which is good) = confirmatory data analysis
45
Q

cross-validation

A

cross validation - various techniques involved in testing and training a model on different samples of data

canonical cross validation = classical replication (where a model is trained on a dataset and tested on a completely different and independent dataset)

46
Q

k-folding

A

used to test our model when it is not possible to collect new datasets - we recycle our original dataset.
k = number of folds (typical number is 10)

procedure:
- collect data e.g. for 100 participants
- use 90 people to train your data and then test the model on predicting the remaining 10 = one fold
- repeat this until everyone’s data has been used to both test and train models

47
Q

confirmatory research

A

characterised by the fact that you specify prior to data collection the exact statistical analyses you intend to run and your expectations about the relationship between variables

48
Q

mean squared error

A

used to assess model fit

MSE = (observed y value - estimated y value) squared to avoid negative numbers, added up for all observations and then multiplied by 1/n

the bigger the difference between the model estimate and the observed value, the higher MSE will be, indicating a worse model

HOWEVER MSE are heavily influenced by outliers in the data which sometimes leads researchers to choose other methods such as mean absolute error instead

49
Q

poisson regression

A

only briefly touched on this
made specifically to deal with the problem of not allowing values that are below 0 and have a count tendancy