A2. GLM Flashcards
(31 cards)
Advantage of using log link function for rate making
- Simple and practical to implement
- Guarantee positive premiums
- impact of risk characteristics is more intuitive
2 uses of offset terms
- incorporate pre-determined values for certain variables
- when target variable varies directly on a particular measure
2 solutions to deal with correlation
- remove all except one (but could lose unique info)
- Dimensionality reduction technique like PCA/Factor analysis (but takes additional time)
Problem with correlations among variables
could produce unstable model with erratic coefficients that have high standard error
2 uses of weight assigned to each observation
- when an observation can obtain grouped information
- when different observations represent different time periods
define multicollinearity
nearly perfect linear dependency
how to detect multicollinearity
use VIF (variance inflation factor) to detect
VIF >=10 considers high
define aliasing
perfect linear dependency
GLM will not converge
2 GLM limitations
- GLMs give full credibility, even to low volume of data or high volatility
- GLMs assume that randomness of outcomes are uncorrelated (renewal of the same policy, weather events)
4 Advantages of modeling frequency/severity over pure premium
- gains more insights and intuition about impact of each predictor variable.
- each of frequency and severity separately is more stable
- pure premium modeling can lead to overfit if a predictor variable only impact frequency or severity but not both.
- Tweedie distribution for pure premium model assumes both frequency and severity move in the same direction (which may not be true)
2 disadvantage of modeling frequency/severity over pure premium
- require more data
- take more time to build 2 models
4 ways to transform variables in GLM
- binning the variable (increase df, more things to estimate which may lead to overfit, may result in inconsistent or impractical pattern, variation within bins is ignored)
- Add polynomial terms (loss of interpretability without a graph, higher order polynomials can behave erratically at edges of the data)
- Add piecewise linear function (adding hinge function max (0, Xj-C) at each breaking point C, Breaking point C must be manually chosen)
- Natural Cubic Splines (combines piecewise function and polynomials, results in continuous curve but fits edges of the data better, but need graph to interpret model)
Why is model selection different from model refinement
- some model may be proprietary
- decision on final model may be a business decision not a technical one
3 methods to test model stability
- Cook’s distance for individual record (high cook’s distance should be given additional scrutiny to whether include or not)
- Cross-validation comparing in-sample parameter estimates
- Bootstrapping to compare mean and variance
4 Lift based measures
- simple quintile plot
- double lift chart
- loss ratio charts
- Gini index
Describe double lift charts
Calculate sort ratio (sort ratio = model 1 predicted loss cost / model 2 predicted loss cost)
Sort and bucket
Calculate average predicted loss cost for each model in each quantile and average actual loss cost. divide each by the overall average loss cost and plot.
Winning model: one that best matched the actual in each quantile
describe simple quintile plot
Sort data based on predicted loss costs
Then bucket into quantiles with equal exposures
Calculate average predicted loss cost & average actual loss cost for each bucket and graph
Winning model: Predictive accuracy, look at the difference between actual and predicted. Monotonicity, the actual pure premium should increase.
Vertical distance of actual loss cost between 1st and last quantile should be large
Describe loss ratio charts
Sort based on predicted LRs
Bucket into quantiles with equal exposures
Calculate the actual LRs for each quantile & plot
The greater vertical distance between the lowest and highest LR, the greater the model at identifying further segmentation opportunities not present in the current rating plan.
LR is monotonically increasing for each quantile
This is the most easy one to understand
Describe Gini index
ability to identify the best and worst risk
Sort holdout dataset based on the predicted loss cost.
Plot x as cumulative percent of exposures. y as the cumulative percent of actual loss
The curve formed is the Lorenz Curve.
Compare it with the line of equality
The area between the Lorenz Curve and Line of equality is called the Gini Index
The higher Gini Index, the better
Sensitivity formula
true positive/total event occurrences
Specificity formula
true negatives/total event non-occurrences
Partial Residual formula
ri = (yi - mui)* g’(mui) + BetajXij
Scaled Deviance formula
2*(loglikelihood from saturated model - Loglikelihood from the model)
Unscaled Deviance formula
Dispersion parameter x scaled deviance