assumptions of the GLM Flashcards

Question

if we encounter significant heterogeneity of variance

Answer 1

if using two group independent data: use Welch's t may be able to correct heterogeneity using data transformation unless heterogeneity is dramatic, we can often ignore it — ANOVA is reasonably tolerant to deviations from homogeneity of variance

Answer 2

assumption that residuals does not change as a function of y-hat - refers to datasets where x is continuous determine whether residuals vary as a function of y-hat, simply plot predicted values of y (x axis) against residuals (y axis) - pred-resid plot vertical height from 0 shows mag of residuals - no clear change in residuals as a function of x

Answer 3

points around the line of best fit increases as y-hat increases

Answer 4

pred-resid plot useful when multiple groups as orders by increasing value of group mean - easier to identify if variance increased as a function of y-hat

Answer 5

plot(lm_continuous, 1) 1: pred-resid

Answer 6

line of best fit where subset of data points and draws line of best fit for local points - often looks wavy

Answer 7

common variant of pred-resid plot converts pred and resid value to z scores, standardizing the scales z= 0 represents the mean z= 1 represents the score at one standard deviation above the mean useful for identifying outliers - any score with a residual more than 3 standard deviations from the mean - with zpred-zresid plots, these scores would have residuals <-3 or >3

Answer 8

assess linearity by visualizing the data can add line of best fit using geom_smooth() function - specifify linear model, use argument: method="lm" - fit a curve, use argument: method= "loess" - adds 95% CI for line of best fit, can remove by: - se=FALSE - adjusted by level=0.95 - can be colourized by fill

Answer 9

deviation from linearity is clear from scatterplot - nonlinear relationships are more obvious - if relationship was linear, line of best fit would approximate to horizontal line

Answer 10

possible to linearize the relationship between variables by data transformation if relationship can't be linearized, GLM can't be applied

Answer 11

only applies to models with multiple predictor variables assumes that the effects caused by one predictor variable are simply added to the effects caused by a second predictor variable ex. measuring depression in humans (outcome), some participating experience early life stress and some didn't (predictor 1). some participants have a genotype that offers protection against stress, other have susceptible genotype (predictor 2). examining combined effects of stress and genotype on depression

Answer 12

if parallel, the effect of genotype is the same and doesn't depend on stress if not additive and causes interaction; the effect of genotype changes because of stress

Answer 13

will be met if the value of one score is unaffected by values of other scores for the same variable - every score for a variable is independent of every other score - if score from one individual is affected by another individual, there's a lack of independence ex. light source on microscope is dying and fluorescence scores get progressively lower stopping this is largely achieved through careful experimental designs

Answer 14

plotting residuals against the order in which the scores were collected

Answer 15

mathematical manipulation of a variable deviations from normality, homogeneity of variance/homoscedasticity and linearity can be corrected but fixing deviation from one assumption can cause deviation from another assumptions

Answer 16

1. linearity is most important - if data deviates from linearity, we are fitting the wrong model and every part of the analysis will be incorrect 2. normality is second - if data deviates from normality,estimates of regression coefficients may be biased, and distribution of test statistics may deviate from theoretical sampling distribution - deviations from theoretical sampling distribution can be mitigated if; sample size is large, bootstrapping is used (doesn't assume normality) 3. heterogeneity of variance/heteroscedasticity is least important - distribution of test statistics may deviate from theoretical sampling distribution - less problematic if we have a large n or bootstrap

Answer 17

identifying can be difficult, trial and error use distribution of residuals as a guide distribution of x and y scores can also be useful in determining which variable may be causing the residuals to deviate from normality

Answer 18

1. square root 2. cube root (more extreme skew) 3. log2(y) 4. log10(y) [most extreme]

Answer 19

1. square 2. cube 3. 2^y 4. 10^y

Answer 20

data$y <- data$y^2 - squaring the y data fit linear model: lm(y~x, data=data) distribution of residuals: hist(lm_out$residuals) test for normality: shapiro.test(lm_out$residuals) distribution of x scores: hist(data$x) distribution of y scores: hist(data$y) Q-Q plots: plot(lm_out, 2) pred-resid plots: plot(lm_out, 1) test for homogeneity of variance (group data only): car:: leveneTest(y~x, data=data)

Answer 21

if x is continuous and relationship between x and y is nonlinear, you can transform x or y if residuals deviate from normality and scores for one the variables similarly deviates, that variable is candidate for transformation heteroscedasticity: transformation of y is most effective

Answer 22

if data doesn't fit assumptions of GLM, conclusions may not be valid some biological relationships, transformation make sense ex. concentration gradient from a source follows a cube root distribution

Answer 23

if transform a measured variable, meaning of that variable becomes less intuitive data transformation may not be able to fix everything - may create deviation data transformation can be use unscrupulously - p-hacking (avoid by: not performing hypothesis testing on data until after determining best way to transform data)

assumptions of the GLM Flashcards

(47 cards)