Chapter 18 Generalised Linear Modelling Flashcards
Describe the principal modelling techniques appropriate to health and care insurance:
What is an explanatory variable?
- input into a model that is expected to influence the response variable.
- i.e. rating factor
-it is important that explanatory variables make intuitive sense.
What is a response variable?
- output variable from a model is likely to be influenced by an explanatory variable.
- ie price
What is a categorical variable?
- These are explanatory variables which are discrete and distinct, often cannot be given any natural ordering score.
- Eg gender
What is non-categorical variable?
-can take numerical values eg age.
What is an interaction term?
-Used where the pattern in response variable is better modelled by including an extra parameter for each combination of two or more factors.
One-way analysis merits
- prior to use of GLMs the effect of frequency and severity of each rating factor was considered separately.
- This one-way analysis ignores correlations and interaction effects between variables and so may underestimate or double count the effects of variables.
Uses of GLMs
- A GLM can be used to model the behaviour of a random variable that is believed to depend on the values of several other characteristics eg age, sex, chronic condition.
- It is a generalisation of the normal model for multiple linear regression.
What are the drawbacks for the normal model for multiple linear regression?
- it assumes the response variable has a normal distribution
- the normal distribution has a constant variance which may not be appropriate
- it adds together the effects of different explanatory variables, but is often not what is observed
- it becomes long-winded with more than two explanatory variables.
Assumptions of classical linear models
- error term are independent and come from a normal distribution
- the mean is a linear combination of the explanatory variables
- the error terms have constant variance (or homoscedasticity)
What are the two properties of any member of the exponential family?
- the distribution is completely specified in terms of its mean and variance.
- the variance is a function of its mean
What is the link function?
- the link function acts to remove the assumption that the effects of different variables must simply be added together.
- it must be both differentiable and monotonic.
- include:log, logit & identity functions.
Steps for obtaining predicted values from a single GLM
- Specify design matrix X and the vector of parameters Beta
- Choose a distribution for the response variable and the link function.
- Identify the log-likelihood function
- Take logarithm to convert the product of many terms into a sum
- Maximise the logarithm of the likelihood function by taking partial derivatives with respect to each parameter.
- Compute predicted values.
What techniques are used to analyse significance of explanatory variables?
- chi-squared test (off models nested and scale parameter is known)
- the F-statistic - models need to be nested for this to work ( if models nested Andy scale parameter is unknown)
- Akaike Criterion Information - appropriate where models are not nested.
- other methods (over time our relationship with other factors)
Define degrees of freedom
-number of observations - number of parameters
AIC formula
AIC = -2 * log likelihood + 2* number of parameters
- the lower the AIC the better the model.
- fewer parameters is better/parsimonious model.
Measuring uncertainty in the estimators of the model parameters
- The cramer-rao lower bound is used.
- the maximum likelihood estimator theta-hat is distributed N(theta,CRLB).
- standard errors in a GLM will be found using the Hessian matrix.
- this is a matrix of 2nd derivatives.
What other ways can be used to test significance?
- Comparisons with time
- Consistency checks with other factors
Comparisons with time
Consistency checks with other factors
- time is not the only factor that can be used as a consistency check.
- eg an explanatory variable like age would be expected to show the same pattern regardless of geopraphical region.
Testing the appropriateness of models
- The hat matrix is one of the outputs of the model-fitting process.
- It is the matrix H such that y-Hat = Hy
- For Normal multiple linear regression model.
- The diagonal entries, h(i,i) of the matrix are called leverages. h(i,i) in interval (0,1).
- Leverages measure the influence that each observed value has on the fitted value for that observation.
- Data points with high leverages or residuals may distort the outcome and accuracy of a model.
Deviance residuals
- This is the measure of the distance between the actual observation and the fitted value.
- deviance corrects the skewness of the distribution.
Standardised Pearson residuals
- A standardised residual is the difference between the observed response and the predicted value, adjusted for the standard deviation of the predicted value(y_hat adj.) and the leverage of the observed response(y_i adj.).
- These adjustments make it possible to compare Standardised Pearson residuals even where observations have different means.
Residual Plots
-For a particular method if the distribution chosen for the response variable is appropriate then the residuals chart should produce residuals that:
- are symmetrical about the x-axis
- have an average residual of zero
- are fairly constant across the width of the fitted values
Cook’s distance and leverage
- Cook’s distance is used to estimate the influence of a data point on the model results.
- Data points with a Cook’s distance of 1 or more are considered to merit closer examination in the analysis.
- As a result of the investigation into any data points with a high Cook’s distance, decision might be made to remove the observations altogether.