Anki Flashcards

1
Q

How to check to see if the model is underfitting

A

Compare the model performance against a simple model such as the average target value or a GLM with only a few predictors
✔ Underfitting if performance is the same or worse
✔ It is not sufficient to just look at the training and testing error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

BIC

A

Bayesian Information Criterion
Used to compare GLMs
Lower is better
Minimize error and maximize likelihood

2log(nrow(train)) - 2log(likelihood)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

__ is when there is no pattern between the missingness and the value of the variable

A

Missing at random (MAR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When predicting whether or not a policy will file claims, True Negatives (TN) are policies which

A

When a policy is predicted to not have a claim and does not have a claim

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the bias-variance-tradeoff

A
  • the tradeoff between bias (underfitting) and variance (overfitting)
  • increase the bias will often decrease the variance
  • increasing the variance will often decrease the bias

Mean Squared Error = variance + bias^2 + irreducible error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When the distribution of a predictor variable is right-skewed you should ___

A

apply a log transform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True/False: The goal of feature selection is to choose the features which are most predictive and discard those which are not

A
False.  Features may be predictive but excluded because of 
✔ Racial or ethical concerns
✔ Limitations of future availability 
✔ Stability of the data over time
✔ Inexplicability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Decision trees identify the optimal variable-split-point combination by measuring ____ or ____

A

Entropy or Gini

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When predicting whether or not a policy will file claims, True Positives (TP) are policies which

A

When a policy is predicted to have a claim and actually does have a claim

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The set of simplifying assumptions made by the model is called

A

bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

GLM response distributions that are strictly positive

A

Poisson (discrete)
gamma (continuous)
inverse gaussian (continuous)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

For the regression metric RMSE, is higher or lower better?

A

Lower

“Minimize error and maximize likelihood”
RMSE = root mean squared error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to check to see if the model is overfitting

A

Train error is much better than the test error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

One disadvantage of ___ models is that the predictor variables need to be uncorrelated.

A

GLM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Penalized regression model(s) where variables are removed by having their coefficients set to zero

A

LASSO and Elastic Net

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When fitting a GLM, if the distribution of the target variable is right-skewed, you should ____

A

use log link function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the objective of the k-means algorithm?

A

To partition the observations into k groups such that the sum of squares from points to the assigned cluster centers is minimized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The variable “body mass index” contains missing values because the laptop that they were stored on had coffee spilled on it. This is an example of

A

Missing at random (MAR).

✔ There is no pattern between whether the value is missing and the target value.
✔ Observations can safely be omitted from the data with no loss in predictive power besides the smaller sample size.
✔ If > 20% of records are missing, consider removing the variable altogether.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When running k-means clustering, it is best to use multiple starting configurations (n.starts between 10-50) and then take the average cluster centers from all of them because this reduces the likelihood of ____

A

Getting stuck at a local minimum as opposed to the global minimum of the sum of squared errors between the cluster centers and each of the points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When a hierarchical clustering algorithm uses single linkage (the default), the distances between two clusters are computed by

A

Computing the distances between all points between clusters A and B and the using the smallest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Define an interaction effect

A

When the impact a predictor variable on the target variable differs based on the value of another predictor variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

One disadvantage of ____ models is that they are unable to detect non-linear relationships between the predictor variables.

A

GLMs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

When a hierarchical clustering algorithm uses complete linkage (the default), the distances between two clusters are computed by

A

Computing the distances between all points between clusters A and B and using the largest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

AIC

A

Akaike Information Criterion
Used to compare GLMs
Lower is better

2p - 2log(likelihood)
p = # parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
When fitting a Decision Tree, if the distribution of a predictor variable is right-skewed, you should ____
Do nothing because tree splits are based on the rank ordering of the predictor and so applying a log would make no difference in the performance
26
Not being complex enough to capture signal in the data is called
the bias of the model | high bias is the same as underfitting
27
One of the assumptions of a GLM is that the _____ is related to the linear predictor through a link function
mean of the target distribution
28
You should combine observations from factor levels with few observations into new groups that have move observations because doing so ____
reduces the dimension of the data set and increases predictive power
29
The amount by which the model will change given different training data is also called
Variance | High variance is the same as overfitting
30
Pearson's Goodness of Fit Statistic
Used to measure the fit of Poisson (counting) models The lower the better
31
In GLMs, we set the base factor levels to be the ones with the most observations because
This makes the GLM coefficients more stable because the intercept term is estimated with the largest sample size
32
The expected loss (error) from the model being too complex and sensitive to random noise is called _
The variance of the model
33
During data preparation, the only times that you should not combine factor levels with few observations together are when
- The mean of the target between levels is not similar - it would make the results less interpretable - project statement says not to
34
Advantages of single decision trees
- easy to interpret - performs variable selection - categorical variables do not require binarization for each level to be used as a separate predictor - captures interactions - captures non-linearities - handles missing values
35
Describe imbalanced data
Target is a binary outcome with more observations of one class (majority) than the other (minority)
36
In binary classification, what is the interpretation of the model metric AUC when it is close to 0.5?
The model is doing no better than random guessing
37
A "drug use" variable that has missing values because some respondents were reluctant to admit that they have broken the law. This is an example of ___
Missing not at random (MNAR)
38
When using this penalized regression model, the sizes of coefficients are reduced (shrunk) but never zero
Ridge Regression
39
When using a ___ link function, the coefficients can be explained as the impact on a z-score for a Normal distribution
probit
40
One of the assumptions of ____ is that the target variable has a specific distribution
GLM
41
For the metric log-likelihood, is higher or lower better?
Higher
42
Lamda (Elastic Net)
- determines the strength of regularization to use - R tests a sequence of lambda values using cross validation and then chooses the one with the lowest test error 1/2MSE + λ(penalty) ?glmnet
43
For the regression metric Mean Absolute Error (MAE), is higher or lower better
lower
44
Using a ___ link function with a GLM results in a multiplicative model
log
45
How do GLMs handle missing values?
They get removed automatically by most software. This can result in a loss of useful information that could be predictive if there is any pattern in the missing values.
46
When the target distribution is strictly positive and continuous, the best GLM response distributions are
gamma | inverse gaussian
47
How do decision trees handle interactions?
Because decision trees use a series of conditional yes/no questions, the impact that a predictor has can be different depending on which previous splits were used
48
When a hierarchical clustering algorithm uses complete linkage (the default), the distances between two clusters are computed by
Computing the distances between all points between clusters A and B and using the largest.
49
When predicting whether or not a policy will file claims, False Positives (FP) are policies that
Predicted to have a claim but didn't actually have a claim
50
Formula: Sensitivity or True Positive Rate (TPR)
TP/(TP + FN)
51
When predicting whether or not a policy will file claims, False Negatives (FN) are policies which
Predicted to not have a claim but actually did have a claim
52
When fitting a Bagged Tree, if the distribution of the target variable is right-skewed, you should ____
Do nothing because tree-based models automatically capture monotonic transformations.
53
In binary classification, a logit has an AUC of 0.99. What are the approximate sensitivity (TPR) and specificity (TNR) values?
Both are close to 1
54
Disadvantages of hierarchical clustering
- doesn't work well on large data sets because it can be difficult to determine the correct number of clusters by the dendrogram - computation complexity can result in very long computation times (as opposed to kmeans which is faster)
55
GLM output: Normal Q-Q graph
- The normal quantile-quantile graph shows the theoretical quantiles against the observed quantiles of the deviance residuals - the residuals should always be normally distributed regardless of the GLM response family (except for binomial) - some deviations along the upper and lower quantiles are acceptable. This indicates that the residuals have a "fat tail"
56
Describe the Tweedie Distribution
- A GLM response distribution which is a good fit for insurance claims data when there is over-dispersion such as having many values of zero - model frequency as well as severity at the same time
57
GLM offset
- A constant term that is added to the linear predictor - the same as including a variable which has a coefficient equal to 1 - exam pa only appear: 1. with poisson regression 2. with a log link function 3. as a measure of exposure such as the length of policy period - remember to apply a log to the offset when using a log link function
58
In binary classification, what is the interpretation of the model metric AUC when it is close to 1.0?
The model predicts the target perfectly
59
Advantages of hierarchical clustering
- the dendrogram helps to understand the data - is the best fit for hierarchical data (i.e., geography such as city, state, country) - shows how much clusters differ based on dendrogram length - no input parameters
60
In a GLM, what does the p-value of a coefficient represent?
For a given coefficient estimate, the p-value is an estimate of the probability of a value of that magnitude (or higher) arising by pure chance
61
How do GLMs handle interactions?
they need to be added manually
62
In logistic regression, what is the formula to convert the linear predictor, z, to the probability, p?
p = e^z / (1 + e^z)
63
The process of k-means
1. select the number of clusters, k 2. randomly assign cluster centers 3. put each point into the cluster that is closest 4 - 6. Move the cluster centers to the mean of the points assigned to it and continue until the centers stop moving 7. repeat steps 1-6 n.starts number of times to reduce the randomness of choosing the initial cluster centers
64
Define the curse of dimensionality
1. When there are more features than observations (p > n) then we run the risk of overfitting the model. Using a dimensionality reduction method (PCA) or a model which performs feature selection can help this 2. When there are too many features, observations become harder to cluster because every observation in the data appears equidistant from the others. If the distances are all approximately equal, then all the observations apprear equally alike
65
Describe Principal Component Analysis (PCA)
A dimensionality reduction method which converts potentially correlated variables into a subset of linearly independent new variables called principal components (PCs) - each PC is created so that all prior PCs retain as much info from the original data as possible - scaling is applied to each variable prior to fitting - size and sign of the PC loadings are useful interpretations
66
Advantages of boosted trees
- high accuracy - is effective in a wide range of applications - handles nonlinearities, interaction effects, and missing data
67
Claim frequency model ``` Target: ? Distribution: ? Link: ? Weight: ? Offset: ? ```
``` Target: Counting variable Distribution: Poisson Link: Log Weight: none Offset: log(# of exposures) ```
68
GLM output: Residuals vs. fitted
Good fit: - all points are centered near zero on the y-axis and spread out symmetrically along the x-axis - this indicates that the variance is constant - the mean of the residuals is near zero
69
Disadvantages of single decision trees
- lacks predictive power - can overfit to the data easily - There is often a simplification of the underlying process because all observations at terminal nodes have an equal predicted value
70
Tweedie distribution power variance parameter
``` Power variance, p, specifies the distribution: p = 0: Gaussian p = 1: Poisson p = 2: Gamma p = 3: Inverse Gaussian ```
71
Disadvantages of bagged trees
- high complexity - difficult to interpret - requires a lot of computation power
72
Advantages to GLMs
- easy to interpret - can easily deploy to spreadsheet format - handles different response distributions - is commonly used in insurance rate making
73
When fitting a Bagged Tree, if the distribution of the target variable is right-skewed, you should ____
Do nothing because tree-based models automatically capture monotonic transformations.
74
Bagging vs Boosting: Predictions are made? Easy to overfit? Improves predictive power by?
Bagging: Predictions are made: in parallel Easy to overfit: No Improves predictive power by: reducing variance Boosting Predictions are made: sequentially Easy to overfit: yes Improves predictive power by: reducing variance and bias
75
Disadvantages of boosted trees
- high complexity - hard to interpret - easy to overfit if not tuned correctly - requires a lot of computation power
76
Advantages of bagged trees
- high accuracy - resilient to overfitting due to bagging - only two parameters to tune (mtry, ntrees) - handles nonlinearities, interaction efffects, and missing data
77
Disadvantages to GLMs
- does not select features without techniques - strict assumptions around distribution shape, the randomness of error terms - predictors need to be uncorrelated - unable to detect nonlinearity (without manual adjustments) - sensitive to outliers - low predictive power
78
Formula: Specificity (False Negative Rate)
TN / (TN + FP)
79
How to interpret the coefficients of a probit model
+ positive coefficients for an input variable increase their linear predictor which is a z-score - negative coefficients decrease it numbers further from zero have larger effects
80
Define data leakage
When the data you are using to train a machine learning algorithm happens to have the information you are trying to predict
81
Both AIC and BIC penalize the log-likelihood based on ____
the number of parameters
82
Why is binarization of dummy (indicator) variables performed?
For stepwise selection in GLMs in order to remove factor levels, rather than keep or remove the whole variable
83
Define multicollinearity in GLMs
- correlation between any two predictors is large - any predictors are a linear combination of the others Solutions: - remove all but one of the predictors - preprocess data using PCA - use a tree-based model
84
Accuracy
- The percentage of observations which are classified correctly - fails when we have imbalanced classes. In those cases AUC is more appropriate
85
Decision Tree Complexity Parameter (CP)
?rpart - CP value represents "minimum benefit" that a split must add to the tree ``` cp = 0: no restrictions -> results in a tall tree -> high complexity -> high variance cp = 0.01 (default): each split must improve the fit by 0.01 when evaluated on the test set ```
86
BIC favors models with fewer parameters than AIC does when ____
log(nrow(train_data)) > 2
87
Correlation Key Points
- measures lineaer association between two variables - positive correlation is when increasing one tends to increase the other and negatively correlation when decreasing one tends to increase the other - does not equal causation
88
Decision tree cost-complexity pruning
- choose a tree that strikes a balance between having a low error and having few splits so that it can be interpreted - adjusted for overfitting (tree too complex) or underfitting (tree too simple) Steps: 1. a decision tree with many leaves is created 2. complexity is calculated for all subtrees using cross-validation 3. the least important branches are pruned
89
Area Under the Curve (AUC) probability interpretation
The probability that a randomly chosen positive class is ranked higher than a randomly chosen negative class.
90
Alpha (elastic net)
the elastic net mixing parameter
91
GLM: Claim Frequency Model ``` Target Variable: Average number of claims per policy period Response Family: ? Link Function: ? Offset: ? Weight: ? ```
Response Family: Poisson Link Function: Log Offset: None Weight: policy period (or other units of exposure) results in same predictions as the claim count model