Chapter 18 - GLM Flashcards
Explanatory variables
- inputs into the model that are expected to influence the response variable
- choice of explanatory variables depends on the purpose of the model
Response variables
- outputs from the model that are likely to be affected by explanatory variables
Categorical variables
- explanatory variables
- aka factors
- values of each level or distinct
- often cannot be given natural ordering or score
- continuous numerical variables (e.g. age) are often categorical
Non-categorical variables
- can take numerical values
Interaction terms
- included where pattern of response variable is better modelled by including parameters for each combination of two or more factors
What does a GLM do?
A GLM unpicks relationships and produces estimates of the true values of the relativities. It does this by taking account of correlations and allowing for investigation of any interactions between variables in the model
Assumptions of classic linear model
- the error terms are independent and come from a normal distribution
- the mean is a linear combination of the explanatory variables
- the error terms have constant variance
Can estimate the parameters B0, B1, B2 using method of maximum likelihood
pg.635
Drawbacks of the normal model for multiple linear regression
- assumes that the response variable has a normal distribution which may not be appropriate for the variable being modelled
- the normal distribution has a constant variance which may not be appropriate for the variable being modelled
- adds together the effects of different explanatory variables, but this is seldom what is observed in practice
- with more than two explanatory variables, a manual solution becomes increasingly long-winded
How do GLMs address these problems?
- the response variable can take any distribution from the exponential family
- a link function is introduced which acts to remove the assumption that the effects of different variables must simply be added together
- allow an offset term to be included within the linear predictor
GLM form
Pg. 639
Properties of members of the exponential family
- the distribution is completely specified in terms of its mean and variance
- the variance of Yi is a function of its mean
Requirements for link function
- differentiable
- monotonic
Obtaining the predicted values from a simple GLM
- specify the design matrix X and the vector of parameters B
- choose the distribution for the response variable and the link function
- identify log-likelihood function
- take the log to convert product into sum
- maximise the log of the likelihood function by taking partial derivatives with respect to each parameter, setting them to zero and solving the result of the system of equations
- compute the predicted values
Degrees of freedom
number of observations less the number of parameters
Deviance formula
Compares observed value Y to fitted value u, with allowance for weights
pg.649
Nested models
Two models are nested if one model contains explanatory variables that are a subset of the explanatory variables in the other model.
How to compare two nested models
- chi-square test for the change in scaled deviance
- this measures whether the inclusion of one or more additional explanatory variables in a model improves the model fit significantly
F statistics
- in the case where the scale parameter for the model is unknown (gamma) it has to be estimated
- the estimate of the scale parameter is chi-squared
- the ratio of the change in the deviance and the scale parameter is distributed with F distribution
How to compare models that are not nested?
AIC = -2log likelihood + 2number of parameters
looks at the tradeoff of the likelihood of a model against the number of parameters; the lower the AIC, the better the model. If two models fit the data equally well in terms of the log-likelihood, then the model with fewer parameters is better.
Use of CRLB
- can be used to measure the uncertainty in the parameter estimators used in a GLM. A poorly defined parameter will have a large standard error.
- standard errors can be found from the Hessian matrix (matrix of second derivatives of the log-likelihood)
- standard errors are the diagonal entries of -G^(-1)
Other ways to test significance
- consider spread of relativity values for each level, combined with the standard errors at each level
- comparison over time - analysis of claims frequency by factor by year will indicate whether claims frequencies have been stable over time. Can fit a model that includes interaction of a single factor with measure of time.
- consistency checks with other factors e.g. age and region
Hat matrix
shifts the vector of observed values to the vector of fitted values
ith leverage
- ith diagonal element of the hat matrix (lies between 0 and 1)
- measure of how much influence the ith observation has over its own fitted value