assumptions of the GLM Flashcards

1
Q

CLT assumptions

A

when samples are large n> 30, the DSM will be normal regardless of the distribution of scores in the underlying population

if n< 30 and distribution of scores in population does not fit the normal distribution, the DSM might not be normal

application of the CLT where n< 30 therefore requires the assumption of normality (fundamental assumption of the GLM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

5 main assumptions of the GLM

A
  1. normality
  2. homogeneity of variance/ homoscedasticity
  3. linearity
  4. additivity
  5. independence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

violating assumption of the GLM (model)

A

linearity: fit a linear model where the relationship is not actually linear

additivity: fit an additive model where relationship is not actually additive

normality: the mean is not an appropriate measure of central tendency (given the distribution of residuals in the model) leading to statistical bias
- statistical bias: where the sample statistic systematically over or under estimates the population parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

violating assumption of the GLM (error)

A
  1. if other assumptions are incorrect, the error may not fit with the assumed distribution of errors, resulting in the method of significance testing being appropriate to the data, causing p values to be unreliable
    - deviation from normality, homogeneity of variance and homoscedasiticity can all result in a mismatch between the actual sampling distribution (equivalent to the DSM) and the theoretical distribution

ex. distribution of t statistics that would be calculated if H0 is true does not fit the theoretical t distribution
conversion of test statistic to p value may be incorrect

  1. assumption of independence is important for a slightly different reason
    - violating the assumption of independence results in a sample that is not representative of the underlying pop
    - unreliable conclusion
  2. inferential stats is formalized guess work, where we use defined framework to make inferences about the op from the sample
    - the define framework contains defined assumptions that allow us to make these inferences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how to test the assumptions of the GLM

A

lm() fits GLM to data (lm= linear model)
- useful in assessing assumption of the GLM
- first use lm to fit linear model to two different datasets

assessment of GLM assumptions generates results that we typically don’t show

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

why are the tests not perfect?

A
  1. several rely on subjective judgements based on visualizations of data
  2. several are underpowered (small samples, false neg) or overpowered
  3. large samples, small and unimportant deviations from assumption may achieve stat sig
    - true positives, don’t rep problematic deviations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

as.factor()

A

use to overwrite two_group$x variable by converting categorical variable to a factor (grouping variable)

ex. as.factor (two_group$x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

plot()

A

quick visualization of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

fitting linear model in R

A

using LSE:
lm(outcome~predictor, data)

outputs:
regression coefficients (can build GLM)

write results to an object output:
- coefficients: regression coefficients (b0,b1)
- fitted.values: predicated values of y-hat
- residuals: values of error for yi(yi-y-hat)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

inferential statistics

A

using output from lm function,c an run summary() and ANOVA() to obtain results

can see that results for two group analysis using ANOVA match for two-group test

ANOVA: F is test statistic, that is t^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Lm outputs in assumptions of GLM

A

tests:
normality
linearity
homogeneity of variance and homoscedasticity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

normality of residuals

A

GLM assumes residuals have normal distribution

residual distribution is generally different from distribution of scores
- can view distribution of scores to understand why residuals deviate from normality
- ex. if residuals have a positive skew, and y variable has a positive skew, might be possible to correct the skew of residuals by transforming the y variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why do we assume normality of residuals?

A

least squares error method (LSE) assumes normality
- if residuals are noraml, mean will be appropriate model
- if residuals are skewed, mean is not appropriate measure of central tendency

sig testing assumes normality of residuals
- residuals are used to build the sampling distribution of test statistics, based on the assumption of normality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

visualizing residuals in R

A
  1. extract residuals from lm output into a vector
    - ex. lm_two_group$residuals
  2. use hist() to visualize the distribution of residuals
    - ex. hist(lm_two_group$residuals, breaks=20)
  3. to better see how fits with normality, can replot using ggplot to add the normal curve
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

quantile-quantile (QQ) plots

A

allow us to use a straight line to judge fit of two curves

theoretical normal distribution is divided into quantiles
- quantile: subset of defined size
- values at the boundaries of each quantile are then extracted

observed distribution of residuals is divided into quantiles
- values at boundaries of each quantile are extracted

boundaries of the quantiles for the normal distribution (x) are plotted against the boundaries of quantiles for the observed distribution of residuals (y) forming a quantile-quantile (Q-Q) plot
- two distributions are identical, all points would fit on straight line where x=y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

qq plots from lm outputs

A

plot(lm_two_group, 2)
- the 2: means it’s a Q-Q plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

significance testing for normality

A

H0: distribution is normal
- sig result means observed distribution is sig different form normality

not only rely on sig testing
- non-sig results may be false negatives
- sig results may not be important if large sample size

H0: observed distribution is the same as the normal distribution
- if test for normality is significant, the data doesn’t fit the normal distribution
- if test for normality is not significant, the data does fit normality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

shapiro-wilk test for normality

A

shapiro.test(vector_of_residuals)

ex.shapiro.test(lm_continuous$residuals)

output: get a p value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

homogeneity of variance

A

if x is categorical (forms groups), the assumption is called homogeneity of variance

violation of this assumption means there is heterogeneity of variance

20
Q

homoscedasticity

A

if x is continuous, the assumption is homoscedasticity

violation of this means there’s heteroscedasticity (cone shape of residuals, gets larger the higher the values)

21
Q

violating homogeneity of variance and homoscedasticity

A

heterogeneity of variance/heteroscedasticity is characterized by having larger residuals for larger (or smaller) values of y-hat
- residuals may still be symmetrical, in which case the regression coefficients will remain unbiased
- the model may be valued

estimates of population variance may be inaccurate, as estimates generated from sample data will vary depending on value of y-hat
- if estimates of population variance are inaccurate, sampling distributions may be inaccurate, creating error in the estimation of the p value

22
Q

assumption of homogeneity of variance and independent t test

A

if variance estimates are similar for the two groups= generate single estimate of population— pooled variance

if variance estimates are different between the two groups, pooled variance is a poor estimate in either pop
- use Welch’s t test which doesn’t assume homogeneity of variance

for two sample independent groups data, deviation form homogeneity of variance can be tolerated by simply selecting WELCH’S t

23
Q

Levene’s test for homogeneity of variance

A

H0: there is homogeneity of variance between groups (no different between residuals between groups)

if sign: we have difference in variance/size of residuals between groups —> heterogeneity of variance

  1. fit a linear model to calc residuals
  2. convert residuals to absolute residuals
  3. uses ANOVA (grouped data >2)

can be underpowered (small n) and overpowered (large n)
- important to consider the results from sig testing, and visualize data when making a judgement
- can visualize heterogeneity of variance using pred-resid plots

24
Q

levene’s test in R

A

leveneTest(y~x, two_group, center= “mean”)
default center is median
output: p value

25
if we encounter significant heterogeneity of variance
if using two group independent data: use Welch's t may be able to correct heterogeneity using data transformation unless heterogeneity is dramatic, we can often ignore it — ANOVA is reasonably tolerant to deviations from homogeneity of variance
26
homoscedasticity
assumption that residuals does not change as a function of y-hat - refers to datasets where x is continuous determine whether residuals vary as a function of y-hat, simply plot predicted values of y (x axis) against residuals (y axis) - pred-resid plot vertical height from 0 shows mag of residuals - no clear change in residuals as a function of x
27
heteroscedasticity
points around the line of best fit increases as y-hat increases
28
grouped data
pred-resid plot useful when multiple groups as orders by increasing value of group mean - easier to identify if variance increased as a function of y-hat
29
generating pred-resid plots in R
plot(lm_continuous, 1) 1: pred-resid
30
loess method
line of best fit where subset of data points and draws line of best fit for local points - often looks wavy
31
zpred-zresid plots
common variant of pred-resid plot converts pred and resid value to z scores, standardizing the scales z= 0 represents the mean z= 1 represents the score at one standard deviation above the mean useful for identifying outliers - any score with a residual more than 3 standard deviations from the mean - with zpred-zresid plots, these scores would have residuals <-3 or >3
32
assumption of linearity
assess linearity by visualizing the data can add line of best fit using geom_smooth() function - specifify linear model, use argument: method="lm" - fit a curve, use argument: method= "loess" - adds 95% CI for line of best fit, can remove by: - se=FALSE - adjusted by level=0.95 - can be colourized by fill
33
linearity and pred-resid plots
deviation from linearity is clear from scatterplot - nonlinear relationships are more obvious - if relationship was linear, line of best fit would approximate to horizontal line
34
correcting deviations from linearity
possible to linearize the relationship between variables by data transformation if relationship can't be linearized, GLM can't be applied
35
assumption of additivity
only applies to models with multiple predictor variables assumes that the effects caused by one predictor variable are simply added to the effects caused by a second predictor variable ex. measuring depression in humans (outcome), some participating experience early life stress and some didn't (predictor 1). some participants have a genotype that offers protection against stress, other have susceptible genotype (predictor 2). examining combined effects of stress and genotype on depression
36
additivity spotting in graph
if parallel, the effect of genotype is the same and doesn't depend on stress if not additive and causes interaction; the effect of genotype changes because of stress
37
assumption of independence
will be met if the value of one score is unaffected by values of other scores for the same variable - every score for a variable is independent of every other score - if score from one individual is affected by another individual, there's a lack of independence ex. light source on microscope is dying and fluorescence scores get progressively lower stopping this is largely achieved through careful experimental designs
38
testing for independence
plotting residuals against the order in which the scores were collected
39
data transformation
mathematical manipulation of a variable deviations from normality, homogeneity of variance/homoscedasticity and linearity can be corrected but fixing deviation from one assumption can cause deviation from another assumptions
40
prioritizing assumptions
1. linearity is most important - if data deviates from linearity, we are fitting the wrong model and every part of the analysis will be incorrect 2. normality is second - if data deviates from normality,estimates of regression coefficients may be biased, and distribution of test statistics may deviate from theoretical sampling distribution - deviations from theoretical sampling distribution can be mitigated if; sample size is large, bootstrapping is used (doesn't assume normality) 3. heterogeneity of variance/heteroscedasticity is least important - distribution of test statistics may deviate from theoretical sampling distribution - less problematic if we have a large n or bootstrap
41
which data transformation is appropriate?
identifying can be difficult, trial and error use distribution of residuals as a guide distribution of x and y scores can also be useful in determining which variable may be causing the residuals to deviate from normality
42
transform to remove positive skew
1. square root 2. cube root (more extreme skew) 3. log2(y) 4. log10(y) [most extreme]
43
transform to remove negative skew
1. square 2. cube 3. 2^y 4. 10^y
44
testing assumptions: summary code
data$y <- data$y^2 - squaring the y data fit linear model: lm(y~x, data=data) distribution of residuals: hist(lm_out$residuals) test for normality: shapiro.test(lm_out$residuals) distribution of x scores: hist(data$x) distribution of y scores: hist(data$y) Q-Q plots: plot(lm_out, 2) pred-resid plots: plot(lm_out, 1) test for homogeneity of variance (group data only): car:: leveneTest(y~x, data=data)
45
transform x or y
if x is continuous and relationship between x and y is nonlinear, you can transform x or y if residuals deviate from normality and scores for one the variables similarly deviates, that variable is candidate for transformation heteroscedasticity: transformation of y is most effective
46
arguments for transformation
if data doesn't fit assumptions of GLM, conclusions may not be valid some biological relationships, transformation make sense ex. concentration gradient from a source follows a cube root distribution
47
arguments against transformation
if transform a measured variable, meaning of that variable becomes less intuitive data transformation may not be able to fix everything - may create deviation data transformation can be use unscrupulously - p-hacking (avoid by: not performing hypothesis testing on data until after determining best way to transform data)