A.2. Generalized Linear Models for Insurance Rating Flashcards
(49 cards)
GLM random component
Each yi is assumed to be independent and to come from the exponential family of distributions with mean µi and variance Var(yi) = φV(µi)/ωi
- φ is called the dispersion parameter and is a constant used to scale the variance.
- V(µ) is called the variance function and is given for a selected distribution type. It describes the relationship between the variance and mean. Note that the same distribution type (e.g., Poisson) must be assumed for all observations.
- ωi are known as weights and assign a weight to each observation i.
GLM systematic component
g(µi) = β0 + β1xi1 + β2xi2 + · · · + βpxip + offset
• The right hand side is known as the linear predictor.
• The offset term is optional and allows you to manually
specify the estimates for certain variables (usually based on other analyses).
• The x predictor variables can be binary (as for levels
of categorical variables) or continuous, or even
transformations or combinations of other variables.
• g(µ) is called the link function, and allows for
transformations of the linear predictor.
• β0 is called the intercept term, and the other β’s are called the coefficients of the model.
Advantages of multiplicative rating plans
- Simple and practical to implement.
- They guarantee positive premiums (not true for additive terms).
- Impact of risk characteristics is more intuitive.
Variance Functions for exponential family
distributions
Distribution Variance Function Normal V(µ) = 1 Poisson V(µ) = µ Gamma V(µ) = µ^2 Inverse Gaussian V(µ) = µ^3 Negative Binomial V(µ) = µ(1 + κµ) Binomial V(µ) = µ(1 − µ) Tweedie V(µ) = µ^p
Choices for Severity distributions
In insurance data, claim severity distributions tend to be
right-skewed and have a lower bound at 0. Both the Gamma and Inverse Gaussian distributions exhibit these properties, and as such are common choices for modeling severity. The Gamma distribution is the most commonly used, but the Inverse Gaussian has a sharper peak and wider tail, so it is more appropriate for more skewed severity distributions.
Choices for Frequency distributions
Claim frequency is most often modeled using a Poisson
distribution. The GLM implementation of Poisson allows
for the distribution to be continuous instead of discrete.
Technically, the overdispersed Poisson is recommended, which allows φ to be different than 1, and thus allows the variance to be greater than the mean (instead of being equal as with the typical Poisson).
Another choice for frequency modeling is the Negative
Binomial distribution, which is really just a Poisson
distribution with a parameter that itself has a Gamma
distribution. With the Negative Binomial, φ is restricted to 1, but instead it contains a dispersion parameter κ in its variance function that allows for the variance to exceed the mean
Relationship between Poisson, Gamma, and Tweedie parameters
• Poisson has parameter λ, which equals its mean and
variance
• Gamma has mean αθ and variance αθ^2 , and thus coefficient of variation 1/ √α
• Tweedie has mean µ = λ × (αθ) and variance φµ^p
• p = (α+2)/(α+1) , so it depends entirely on the Gamma coefficient of variation
• The Tweedie dispersion parameter is
φ =[λ^(1−p)×(αθ)^(2−p) ] / (2−p)
Logit and Logistic Functions
Logit: g(µ) = ln [µ/(1−µ)]
. The ratio of µ/(1−µ) is known as the odds (e.g., a thousand to one).
Logistic function (inverse of logit): 1/(1+e^(−x)) .
Why continuous predictor variables should usually be logged and exceptions
Continuous variables should usually be logged when a log link function is used to allow GLMs flexibility in fitting
different curve shapes to the data (other than just exponential growth).
Exceptions to the general rule of logging a continuous
predictor variable exist such as using a year variable to pick up trend effects. Also, if the variable contains values of 0, an adjustment such as adding 1 to all observations must first be made since ln(0) is undefined.
Impact of choosing a level with fewer observations as the base level of a categorical variable
This will still result in the same predicted relativities for that variable (re-based to the chosen base level), but there will be wider confidence intervals around the estimated coefficients.
Matrix form of a GLM
g(µ) = Xβ, where µ is the vector of µi values, β is the vector of β parameters, and X is called the design matrix.
Degrees of freedom for a model
The degrees of freedom of a model is the number of
parameters that need to be estimated for the model.
GLM outputs for each predicted coefficient
Standard error
p-value: an estimated probability that the absolute value
of a particular β is at least that different from 0 by pure chance
Confidence interval
How number of observations and dispersion parameter impact p-values
p-values (and standard errors and confidence intervals) will be smaller with larger datasets that have more observations. They will also be smaller with smaller values of φ.
Problem and options for GLMs with highly correlated
variables
This can result in an unstable model with erratic coefficients that have high standard errors. Two options for dealing with very high correlation include:
- Removing all highly correlated variables except one. This eliminates the high correlation in the model, but it also potentially loses some unique information contained in the eliminated variables.
- Use dimensionality-reduction techniques such as principal components analysis or factor analysis to create a new subset of variables from the correlated variables, and use this subset of variables in the GLM. The downside is the additional time required to do this extra analysis.
Define multicollinearity and give a way to detect it
Multicollinearity occurs when there is a near-perfect linear dependency among 3 or more predictor variables. For example, suppose x1 + x2 ≈ x3. This is more difficult to detect since both x1 and x2 may not be individually highly correlated with x3. When multicollinearity is present in a model, the model may become unstable with erratic coefficients, and it may not converge to a solution. One way to detect multicollinearity is to use the variance inflation factor
(VIF) statistic, which is given for each predictor variable, and measures the impact on the squared standard error for that variable due to collinearity with other predictor variables by seeing how well other predictor variables can predict the variable in question. VIF values of 10 or greater are considered high.
Define aliasing and how GLM software deals with it
When there is a perfect linear dependency among predictor variables, those variables are aliased. The GLM will not converge in this case, but most GLM software will detect this and automatically remove one of those variables from the model.
2 important limitations of GLMs
- GLMs give full credibility: The estimated coefficients are not credibility-weighted to recognize low volumes of data or high volatility. This concern can be partially addressed by looking at p-values or standard errors.
- GLMs assume that the randomness of outcomes are
uncorrelated: Two examples of violations of this are:
• Using a dataset with several renewals of the same policy, since the same insured over different renewals is likely to have correlated outcomes.
• When the data can be affected by weather, the same
weather events are likely to cause similar outcomes to
risks in the same areas
Components of model-building process
- Setting goals and objectives
- Communication with key stakeholders
- Collecting and processing the data
- Conducting exploratory data analysis
- Specifying the form of the model
- Evaluating the model output
- Validating the model
- Translating the model results into a product
- Maintaining and rebuilding the model
Considerations in merging policy and claim data
•Matching claims to specific vehicles/drivers (for auto) or
specific coverages.
• Are there timing differences between the datasets? How often is each updated? Timing differences can cause record matching problems.
• Is there a unique key to merge the data (e.g., policy
number)? There is the potential for orphaned claims if there is no matching policy record, or duplicating claims if there are multiple policy records.
• Level of aggregation before merging? Time dimension
(e.g., CY)? Policy level versus claimant/coverage level?For commercial, location level or policy level?
• Are there fields not needed? Are there fields desired that are not present?
Considerations in Modifying the Data
- Check for duplicate records and remove them
- Check categorical field values against documentation (i.e., are there code values not in the documentation, and are these new codes or errors?)
• Check reasonability of numerical fields (e.g., negative
premiums, significant outliers)
• Decide how to handle errors and missing values (e.g., how much time to investigate, anything systematic about these records such as a specific location, maybe discard these records or replace the bad values with average values or an error flag)
• Convert continuous variables into categorical (called
binning)? Group levels in categorical variables? Combine or separate variables?
Other possible data adjustments before modeling
- Capping large losses
- Removing cats or giving them less weight
- Developing losses
- On-leveling premiums for LR models
- Trending exposures and losses
Purpose of using a separate dataset for testing
After we build a model on a set of data, it would be
inappropriate to test the model on the same set of data since that would give us biased results of the model performance. More variables in the model will always cause the model to fit the training data better, but the model may not fit other datasets better since it implicitly begins assuming that the random noise in the data is really part of the systematic signal. We want to pick up as much signal as possible with minimal noise. As such, before we build our model, we will want to split the data into at least 2 parts: the training set and the test (aka holdout) set.
List 3 Model Testing Strategies
• Train and test: Split data into a single training set and a
single test set. Can split randomly or on time basis. The
advantage of splitting using time is that the same weather events would be in both datasets if split randomly, which can result in over-optimistic validation results.
- Train, validate, and test: Split data into 3 - a training set, a validation set, and a test set. The validation set can be used to refine the model and make tweaks, but the test set should still be left until the model is final.
- Cross-validation: This is less common in insurance since variables are often hand-picked. There are different ways to do cross-validation, but the most common is called k-fold cross validation