Anki Flashcards
How to check to see if the model is underfitting
Compare the model performance against a simple model such as the average target value or a GLM with only a few predictors
✔ Underfitting if performance is the same or worse
✔ It is not sufficient to just look at the training and testing error
BIC
Bayesian Information Criterion
Used to compare GLMs
Lower is better
Minimize error and maximize likelihood
2log(nrow(train)) - 2log(likelihood)
__ is when there is no pattern between the missingness and the value of the variable
Missing at random (MAR)
When predicting whether or not a policy will file claims, True Negatives (TN) are policies which
When a policy is predicted to not have a claim and does not have a claim
Describe the bias-variance-tradeoff
- the tradeoff between bias (underfitting) and variance (overfitting)
- increase the bias will often decrease the variance
- increasing the variance will often decrease the bias
Mean Squared Error = variance + bias^2 + irreducible error
When the distribution of a predictor variable is right-skewed you should ___
apply a log transform
True/False: The goal of feature selection is to choose the features which are most predictive and discard those which are not
False. Features may be predictive but excluded because of ✔ Racial or ethical concerns ✔ Limitations of future availability ✔ Stability of the data over time ✔ Inexplicability.
Decision trees identify the optimal variable-split-point combination by measuring ____ or ____
Entropy or Gini
When predicting whether or not a policy will file claims, True Positives (TP) are policies which
When a policy is predicted to have a claim and actually does have a claim
The set of simplifying assumptions made by the model is called
bias
GLM response distributions that are strictly positive
Poisson (discrete)
gamma (continuous)
inverse gaussian (continuous)
For the regression metric RMSE, is higher or lower better?
Lower
“Minimize error and maximize likelihood”
RMSE = root mean squared error
How to check to see if the model is overfitting
Train error is much better than the test error
One disadvantage of ___ models is that the predictor variables need to be uncorrelated.
GLM
Penalized regression model(s) where variables are removed by having their coefficients set to zero
LASSO and Elastic Net
When fitting a GLM, if the distribution of the target variable is right-skewed, you should ____
use log link function
What is the objective of the k-means algorithm?
To partition the observations into k groups such that the sum of squares from points to the assigned cluster centers is minimized
The variable “body mass index” contains missing values because the laptop that they were stored on had coffee spilled on it. This is an example of
Missing at random (MAR).
✔ There is no pattern between whether the value is missing and the target value.
✔ Observations can safely be omitted from the data with no loss in predictive power besides the smaller sample size.
✔ If > 20% of records are missing, consider removing the variable altogether.
When running k-means clustering, it is best to use multiple starting configurations (n.starts between 10-50) and then take the average cluster centers from all of them because this reduces the likelihood of ____
Getting stuck at a local minimum as opposed to the global minimum of the sum of squared errors between the cluster centers and each of the points
When a hierarchical clustering algorithm uses single linkage (the default), the distances between two clusters are computed by
Computing the distances between all points between clusters A and B and the using the smallest
Define an interaction effect
When the impact a predictor variable on the target variable differs based on the value of another predictor variable
One disadvantage of ____ models is that they are unable to detect non-linear relationships between the predictor variables.
GLMs
When a hierarchical clustering algorithm uses complete linkage (the default), the distances between two clusters are computed by
Computing the distances between all points between clusters A and B and using the largest.
AIC
Akaike Information Criterion
Used to compare GLMs
Lower is better
2p - 2log(likelihood)
p = # parameters