advanced topics Flashcards

Question

sources of endogeneity: omitted/confounding variables

Answer 1

when x is correlated with an omitted value (z), the variance explained by z falls on y and the residual error - solution = ensure all potential confounds are measured and included in the model

Answer 2

instead of measuring x, you measure x* (x with error included) - solution = careful planning and study design

Answer 3

predicting a value from a model within the range of given data points e.g. if your data spans 10 - 50 and using the data to predict someone with a value of 35

Answer 4

using a model to predict a value outside of the range of given data e.g. if your data spans 10 - 50 and you use it to predict someone with a value of 60 or 5 - need to take caution when using extrapolation since we don' t have data points on both sides of our predicted values, we don't know for sure it follows a linear pattern (as we would predict)

Answer 5

- loss of efficiency due to smaller n - bias (i.e. incorrect estimates)

Answer 6

missing at random - related to the predictors but not the outcome "when the probability of missing data on variable X is related to other variables in the model but not the value of X itself" challenge = no way to confirm there is no relation between the predictors and the missing data

Answer 7

missing completely at random - genuinely random missingness, no relation between x/any other variable with the missingness of x - effects all levels of our data equally/without bias

Answer 8

missing not at random "when the probability of missingness on x is related to the values of x itself" challenge = no way to verify MNAR without knowledge of the missing values

Answer 9

likewise deletion = delete everyone from the analysis with missing data - NOT recommended - gives biased results pairwise deletion = uses cases available for each analysis = different cases contribute to different correlation matrixes - NOT recommended (but doesn't reduce power as much as likewise)

Answer 10

mean imputation = replace missing values with the mean of that variable - NOT recommended - artificially reduces variability and biased (probs worst method) regression imputation = replace missing values with their predicted values from regression model - 'normal' vs scholastic (adds a residual term to overcome loss of variance) multiple imputation (MI) = imputes missing data several times to create complete data sets (results are pooled to get parameter estimates and SEs) - recommended if data is likely to be MAR

Answer 11

estimation method = make use of all model information to arrive at the parameter estimate 'as if' the data was complete - recommended if data likely to be MAR or MCAR

Answer 12

selection models = combines model for predicting missingness which adjusts the parameter estimates for the analysis models of interest - often gives worse results than MLE or MI pattern mixture model = stratifies the sample according to different missing data patterns and estimates the substantive model in each subgroup - provides strong, untestable assumptions - good to include as part of sensitivity analysis but often between to use MLE or MI

Answer 13

used when we are interested in the relationship between variables but don't have clear predictions about how they're related/how to test them. exploratory analysis can take many forms but share the common fact that the researcher doesn't have specific predictions about the IV and DV it is just done to learn about your data: - focus on minimising prediction error - data sets must be large enough to support training data - estimate prediction error/assess model performance - control bias-variance trade off

Answer 14

= the tendency for statistical models to fit sample specific noise as if it were signal since noise is random, fitting a model to noise makes it bad at predicting a new dataset

Answer 15

= the data we 'train' our model with (the data used to fit the model line)

Answer 16

= data we use to test how well our trained model can predict

Answer 17

= special (bad) case of overfitting that takes place prior to or in parallel with model estimation e.g. choosing which analysis to report (if data doesn't fit, just remove it/ stargazing)

Answer 18

the tendency for a model to consistently produce answers that are wrong in a particular direction

Answer 19

the extent to which a model's fitted parameters will tend to deviate from their central tendency across different datasets

Answer 20

ideally we want low variance, low bias but that is rare in science so we make trade offs - low bias, high variance = flexible data analysis (almost any pattern can be detected which can be risky) = exploratory data analysis - high bias, low variance = strict adherence to a fixed set of procedures (limited range of patterns identified which is good) = confirmatory data analysis

Answer 21

cross validation - various techniques involved in testing and training a model on different samples of data canonical cross validation = classical replication (where a model is trained on a dataset and tested on a completely different and independent dataset)

Answer 22

used to test our model when it is not possible to collect new datasets - we recycle our original dataset. k = number of folds (typical number is 10) procedure: - collect data e.g. for 100 participants - use 90 people to train your data and then test the model on predicting the remaining 10 = one fold - repeat this until everyone's data has been used to both test and train models

Answer 23

characterised by the fact that you specify prior to data collection the exact statistical analyses you intend to run and your expectations about the relationship between variables

Answer 24

used to assess model fit MSE = (observed y value - estimated y value) squared to avoid negative numbers, added up for all observations and then multiplied by 1/n the bigger the difference between the model estimate and the observed value, the higher MSE will be, indicating a worse model HOWEVER MSE are heavily influenced by outliers in the data which sometimes leads researchers to choose other methods such as mean absolute error instead

Answer 25

only briefly touched on this made specifically to deal with the problem of not allowing values that are below 0 and have a count tendancy

advanced topics Flashcards

(49 cards)