Anki Flashcards

Question

When fitting a Decision Tree, if the distribution of a predictor variable is right-skewed, you should ____

Answer 1

Do nothing because tree splits are based on the rank ordering of the predictor and so applying a log would make no difference in the performance

Answer 2

the bias of the model | high bias is the same as underfitting

Answer 3

mean of the target distribution

Answer 4

reduces the dimension of the data set and increases predictive power

Answer 5

Variance | High variance is the same as overfitting

Answer 6

Used to measure the fit of Poisson (counting) models The lower the better

Answer 7

This makes the GLM coefficients more stable because the intercept term is estimated with the largest sample size

Answer 8

The variance of the model

Answer 9

- The mean of the target between levels is not similar - it would make the results less interpretable - project statement says not to

Answer 10

- easy to interpret - performs variable selection - categorical variables do not require binarization for each level to be used as a separate predictor - captures interactions - captures non-linearities - handles missing values

Answer 11

Target is a binary outcome with more observations of one class (majority) than the other (minority)

Answer 12

The model is doing no better than random guessing

Answer 13

Missing not at random (MNAR)

Answer 14

Ridge Regression

Answer 15

- determines the strength of regularization to use - R tests a sequence of lambda values using cross validation and then chooses the one with the lowest test error 1/2MSE + λ(penalty) ?glmnet

Answer 16

They get removed automatically by most software. This can result in a loss of useful information that could be predictive if there is any pattern in the missing values.

Answer 17

gamma | inverse gaussian

Answer 18

Because decision trees use a series of conditional yes/no questions, the impact that a predictor has can be different depending on which previous splits were used

Answer 19

Computing the distances between all points between clusters A and B and using the largest.

Answer 20

Predicted to have a claim but didn't actually have a claim

Answer 21

TP/(TP + FN)

Answer 22

Predicted to not have a claim but actually did have a claim

Answer 23

Do nothing because tree-based models automatically capture monotonic transformations.

Answer 24

Both are close to 1

Answer 25

- doesn't work well on large data sets because it can be difficult to determine the correct number of clusters by the dendrogram - computation complexity can result in very long computation times (as opposed to kmeans which is faster)

Answer 26

- The normal quantile-quantile graph shows the theoretical quantiles against the observed quantiles of the deviance residuals - the residuals should always be normally distributed regardless of the GLM response family (except for binomial) - some deviations along the upper and lower quantiles are acceptable. This indicates that the residuals have a "fat tail"

Answer 27

- A GLM response distribution which is a good fit for insurance claims data when there is over-dispersion such as having many values of zero - model frequency as well as severity at the same time

Answer 28

- A constant term that is added to the linear predictor - the same as including a variable which has a coefficient equal to 1 - exam pa only appear: 1. with poisson regression 2. with a log link function 3. as a measure of exposure such as the length of policy period - remember to apply a log to the offset when using a log link function

Answer 29

The model predicts the target perfectly

Answer 30

- the dendrogram helps to understand the data - is the best fit for hierarchical data (i.e., geography such as city, state, country) - shows how much clusters differ based on dendrogram length - no input parameters

Answer 31

For a given coefficient estimate, the p-value is an estimate of the probability of a value of that magnitude (or higher) arising by pure chance

Answer 32

they need to be added manually

Answer 33

p = e^z / (1 + e^z)

Answer 34

1. select the number of clusters, k 2. randomly assign cluster centers 3. put each point into the cluster that is closest 4 - 6. Move the cluster centers to the mean of the points assigned to it and continue until the centers stop moving 7. repeat steps 1-6 n.starts number of times to reduce the randomness of choosing the initial cluster centers

Answer 35

1. When there are more features than observations (p > n) then we run the risk of overfitting the model. Using a dimensionality reduction method (PCA) or a model which performs feature selection can help this 2. When there are too many features, observations become harder to cluster because every observation in the data appears equidistant from the others. If the distances are all approximately equal, then all the observations apprear equally alike

Answer 36

A dimensionality reduction method which converts potentially correlated variables into a subset of linearly independent new variables called principal components (PCs) - each PC is created so that all prior PCs retain as much info from the original data as possible - scaling is applied to each variable prior to fitting - size and sign of the PC loadings are useful interpretations

Answer 37

- high accuracy - is effective in a wide range of applications - handles nonlinearities, interaction effects, and missing data

Answer 38

``` Target: Counting variable Distribution: Poisson Link: Log Weight: none Offset: log(# of exposures) ```

Answer 39

Good fit: - all points are centered near zero on the y-axis and spread out symmetrically along the x-axis - this indicates that the variance is constant - the mean of the residuals is near zero

Answer 40

- lacks predictive power - can overfit to the data easily - There is often a simplification of the underlying process because all observations at terminal nodes have an equal predicted value

Answer 41

``` Power variance, p, specifies the distribution: p = 0: Gaussian p = 1: Poisson p = 2: Gamma p = 3: Inverse Gaussian ```

Answer 42

- high complexity - difficult to interpret - requires a lot of computation power

Answer 43

- easy to interpret - can easily deploy to spreadsheet format - handles different response distributions - is commonly used in insurance rate making

Answer 44

Do nothing because tree-based models automatically capture monotonic transformations.

Answer 45

Bagging: Predictions are made: in parallel Easy to overfit: No Improves predictive power by: reducing variance Boosting Predictions are made: sequentially Easy to overfit: yes Improves predictive power by: reducing variance and bias

Answer 46

- high complexity - hard to interpret - easy to overfit if not tuned correctly - requires a lot of computation power

Answer 47

- high accuracy - resilient to overfitting due to bagging - only two parameters to tune (mtry, ntrees) - handles nonlinearities, interaction efffects, and missing data

Answer 48

- does not select features without techniques - strict assumptions around distribution shape, the randomness of error terms - predictors need to be uncorrelated - unable to detect nonlinearity (without manual adjustments) - sensitive to outliers - low predictive power

Answer 49

TN / (TN + FP)

Answer 50

+ positive coefficients for an input variable increase their linear predictor which is a z-score - negative coefficients decrease it numbers further from zero have larger effects

Answer 51

When the data you are using to train a machine learning algorithm happens to have the information you are trying to predict

Answer 52

the number of parameters

Answer 53

For stepwise selection in GLMs in order to remove factor levels, rather than keep or remove the whole variable

Answer 54

- correlation between any two predictors is large - any predictors are a linear combination of the others Solutions: - remove all but one of the predictors - preprocess data using PCA - use a tree-based model

Answer 55

- The percentage of observations which are classified correctly - fails when we have imbalanced classes. In those cases AUC is more appropriate

Answer 56

?rpart - CP value represents "minimum benefit" that a split must add to the tree ``` cp = 0: no restrictions -> results in a tall tree -> high complexity -> high variance cp = 0.01 (default): each split must improve the fit by 0.01 when evaluated on the test set ```

Answer 57

log(nrow(train_data)) > 2

Answer 58

- measures lineaer association between two variables - positive correlation is when increasing one tends to increase the other and negatively correlation when decreasing one tends to increase the other - does not equal causation

Answer 59

- choose a tree that strikes a balance between having a low error and having few splits so that it can be interpreted - adjusted for overfitting (tree too complex) or underfitting (tree too simple) Steps: 1. a decision tree with many leaves is created 2. complexity is calculated for all subtrees using cross-validation 3. the least important branches are pruned

Answer 60

The probability that a randomly chosen positive class is ranked higher than a randomly chosen negative class.

Answer 61

the elastic net mixing parameter

Answer 62

Response Family: Poisson Link Function: Log Offset: None Weight: policy period (or other units of exposure) results in same predictions as the claim count model