Chapter 16 Generalised Linear Modelling Flashcards
Describe the principal modelling techniques appropriate to health and care insurance:
What is an explanatory variable?
-input into a model that is expected to influence the response variable. -i.e. rating factor -it is important that explanatory variables make intuitive sense.
What is a response variable?
-output variable from a model is likely to be influenced by an explanatory variable. -ie price
What is a categorical variable?
-These are explanatory variables which are discrete and distinct, often cannot be given any natural ordering score. -Eg gender
What is non-categorical variable?
-can take numerical values eg age.
What is an interaction term?
-Used where the pattern in response variable is better modelled by including an extra parameter for each combination of two or more factors.
One-way analysis merits
-prior to use of GLMs the effect of frequency and severity of each rating factor was considered separately. -This one-way analysis ignores correlations and interaction effects between variables and so may underestimate or double count the effects of variables.
Uses of GLMs
-A GLM can be used to model the behaviour of a random variable that is believed to depend on the values of several other characteristics eg age, sex, chronic condition. -It is a generalisation of the normal model for multiple linear regression.
What are the drawbacks for the normal model for multiple linear regression?
-it assumes the response variable has a normal distribution -the normal distribution has a constant variance which may not be appropriate -it adds together the effects of different explanatory variables, but is often not what is observed -it becomes long-winded with more than two explanatory variables.
Assumptions of classical linear models
-error term are independent and come from a normal distribution -the mean is a linear combination of the explanatory variables -the error terms have constant variance (or homoscedasticity)
What are the two properties of any member of the exponential family?
-the distribution is completely specified in terms of its mean and variance. -the variance is a function of its mean
What is the link function?
-the link function acts to remove the assumption that the effects of different variables must simply be added together. -it must be both differentiable and monotonic. -include:log, logit & identity functions.
Steps for obtaining predicted values from a single GLM
-Specify design matrix X and the vector of parameters Beta -Choose a distribution for the response variable and the link function. -Identify the log-likelihood function -Take logarithm to convert the product of many terms into a sum -Maximise the logarithm of the likelihood function by taking partial derivatives with respect to each parameter. -Compute predicted values.
What techniques are used to analyse significance of explanatory variables?
-chi-squared test -the F-statistic - models need to be nested for this to work. -Akaike Criterion Information - appropriate where models are not nested. -other methods
Define degrees of freedom
-number of observations - number of parameters
AIC formula
AIC = -2 * log likelihood + 2* number of parameters -the lower the AIC the better the model. -fewer parameters is better/parsimonious model.
Measuring uncertainty in the estimators of the model parameters
-The cramer-rao lower bound is used. -the maximum likelihood estimator theta-hat is distributed N(theta,CRLB). -standard errors in a GLM will be found using the Hessian matrix. -this is a matrix of 2nd derivatives.
What other ways can be used to test significance?
-Comparisons with time -Consistency checks with other factors
Comparisons with time
-analysis of claims frequency by factor by year will indicate whether claims frequencies have been stable over time. -
Consistency checks with other factors
-time is not the only factor that can be used as a consistency check. -eg an explanatory variable like age would be expected to show the same pattern regardless of geopraphical region.
Testing the appropriateness of models
-The hat matrix is one of the outputs of the model-fitting process. -It is the matrix H such that y-Hat = Hy -For Normal multiple linear regression model. -The diagonal entries, h(i,i) of the matrix are called leverages. h(i,i) in interval (0,1). -Leverages measure the influence that each observed value has on the fitted value for that observation. -Data points with high leverages or residuals may distort the outcome and accuracy of a model.
Deviance residuals
-This is the measure of the distance between the actual observation and the fitted value. -deviance corrects the skewness of the distribution.
Standardised Pearson residuals
-A standardised residual is the difference between the observed response and the predicted value, adjusted for the standard deviation of the predicted value and the leverage of the observed response. -These adjustments make it possible to compare Standardised Pearson residuals even where observations have different means.
Residual Plots
-For a particular method if the distribution chosen for the response variable is appropriate then the residuals chart should produce residuals that: -are symmetrical about the x-axis -have an average residual of zero -are fairly constant across the width of the fitted values
Cook’s distance and leverage
-Cook’s distance is used to estimate the influence of a data point on the model results. -Data points with a Cook’s distance of 1 or more are considered to merit closer examination in the analysis. -As a result of the investigation into any data points with a high Cook’s distance, decision might be made to remove the observations altogether.