GLM 3- Assumptions Flashcards
(39 cards)
What is the first assumption of GLM?
Response-predictor Linearity
How can the first assumption of GLM be diagnosed?
Residual plots allow to identify non-linearity.
They plot the fitted values, y^i’s, against the residuals,e^i s.
The plot should indicate a linear trend –i.e. a line– between the fitted values and the residuals.
How can the first assumption of GLM be remedied?
If the residual plot indicates that there is a non-linear relationship, one can either:
Transform the predictors, log X, square root of X
Use polynomial regression by including X2, X3, for instance
What is assumption 2 of GLM?
Constant Variance of Errors
How can the second assumption of GLM be diagnosed?
Residual plots can, again, enable us to assess whether the variances of the error terms are constant.
The error terms are assumed to be homoscedastic. That is, to have identical variance for different levels of the fitted values, y^i ’s.
Consequently, we should observe a uniform distribution of the variation of the residuals across the levels of predicted values.
How can the second assumption of GLM be remedied?
We can transform the response, Y , by taking log(Y ), or square root of Y. If such transformations do not work or are impossible, just report it in your analysis.
We can exploit the source of variability in the responses, if known. The yi ’s may be aggregates with associated variances, σi . In such cases, we can use weighted least squares (WLS).
What is the third assumption of GLM?
Non-correlation of Errors
How can the third assumption of GLM be diagnosed?
Serial Residual plots allow to identify the correlation of the errors.
They plot the residuals, e’s with respect to the observations IDs.
The plot should not indicate long-term dependency between sequences of residuals. This would violate the assumption of independent observations.
How can the third assumption of GLM be remedied?
Typically, non-independence may be present due to some structure in your data, due to groups or time.
Model the group structure in your data, using a mixed-effects model, or hierarchical model.
Model the time-lag structure in your data, again using a mixed-effects model, or hierarchical model.
What is assumption 4 of GLM?
Detecting Outliers
How can assumption 4 of GLM be diagnosed?
An outlier is a point, which is far from the values predicted by the model.
Outliers will result in an increase in the Residual Sum of Squares (RSS),
used to compute R2, and the confidence intervals for each parameter.
The studentized residual plot show the values of the residuals, ei ’s, divided by their standard errors
How can assumption 4 be remedied if violated?
Typically, the studentized residuals should not exceed 3 standard errors.
If a data point has a residual with a studentized residual of 3 or more, you may consider removing it, especially if you suspect that this observation is faulty in some ways.
However, care should be taken as the presence of an outlier may also indicate a deficiency in your model.
What is assumption 5 of GLM?
High-leverage Points
How can assumption 5 of GLM be diagnosed?
A high-leverage point is a data point, whose removal produces a substantially different set of parameters.
The leverage, or hat-value, is a quantity that measures how unusual is that data point with respect to all the others.
We can plot the individual leverages against the values of the studentized residuals to identity points that are outliers and have also high leverage.
How can assumption 5 of GLM be remedied?
There is no specific rule of thumb for detecting high leverage.
As for outliers however, you may consider removing it, especially if you suspect that this observation is faulty in some ways.
But, again, care should be taken as the presence of an outlier may also indicate a deficiency in your model.
What is assumption 6 of GLM?
Multicollinearity
How can assumption 6 of GLM be diagnosed?
Two predictors are said to be collinear is they are strongly correlated with each other.
Pairwise collinearity can be assessed by considering the correlation matrix of your predictors.
Multicollinearity, however, may occur in the absence of severe pairwise collinearity. In that case, we use the variance inflation factor (VIF),
VIF (β^j ) := 1/ 1−R2 Xj |X−j
If the variance in Xj is strongly explained by all the other predictors in the model, then R2 will be close to 1, and therefore its VIF will be very large. Xj |X−j
If violated how can assumption 6 of GLM be remedied?
If one variable, say Xj , exhibits a VIF larger than 5, we may resort to one of the following:
1) Drop that variable.
2) Combine the variables that are collinear, by taking the average of the
standardized variables.
3) Create a latent variable that combine the multicollinear variables.
Note that removing a collinear variable will not compromise the overall fit of your model, because most of the information contained in that variable is also contained in other variables.
What is the default Plotting for lm Objects in R?
Every lm object can be plotted, using one of the following:
plot(fit)
plot(lm(y ∼ x1 + x2))
plot(fit <- lm(y ∼ x1 + x2))
Alternatively, we may select which plot we need as follows:
plot(fit, which=c(1,4))
How do we know the options of the plot function in R?
?plot.lm
- How can residual plot be created in R for assumption 1?
- What does this do?
- plot(fit, which=1)
- Check linearity of errors.
- How can a QQ-plot of Standardized Residuals be fitted in R to test assumption 2?
- What does this do?
- plot(fit, which=2)
- Check normality of residuals.
- How can a Scale-location Plot be fitted in R to test assumption 3?
- What does this do?
- plot(fit, which=3)
- Check homoscedasticity (equal variance)
- How can a Cook’s Distances plot be fitted in R to test assumption 4?
- What does this do?
- plot(fit, which=4)
- Influential (or leverage) cases.