Chapter 4 Tour of Model Evaluation Metrics Flashcards
(30 cards)
Choosing model evaluation metrics is challenging. This challenge is made even more difficult when there is a skew in the class distribution. Why? P 54
The reason for this is that many of the standard metrics become unreliable or even misleading when classes are imbalanced, or severely imbalanced, such as 1:100 or 1:1000 ratio between a minority and majority class.
We can divide evaluation metrics into three useful groups; What are they? P 55
- Threshold Metrics (e.g., accuracy and F-measure)
- Ranking Metrics (e.g., receiver operating characteristics (ROC) analysis and AUC)
- Probability Metrics (e.g., mean-squared error)
When are threshold metrics used? P 55
Threshold metrics are those that quantify the classification prediction errors. That is, they are designed to summarize the fraction, ratio, or rate of when a predicted class does not match the expected class in a holdout dataset. These measures are used when we want a model to minimize the number of errors.
Perhaps the most widely used threshold metric is … and the complement of classification accuracy called …. P 56
Classification accuracy, classification error
There are two groups of metrics that may be useful for imbalanced classification because they focus on one class; they are … and … P 56
sensitivity(TPR)-specificity(TNR), precision-recall
Sensitivity and Specificity can be combined into a single score that balances both concerns, called the G-mean. How is G-mean calculated? P 57
G-mean = √(Sensitivity × Specificity)
Precision and recall can be combined into a single score that seeks to balance both concerns, called the F-score or the F-measure. How is it calculated? P 57
F-measure = 2 × Precision × Recall/( Precision + Recall)
The F-measure is a popular metric for imbalanced classification. True/False P 57
True
What is F Beta -measure? How is it calculated? P 57
The Fbeta-measure (or F β − measure) measure is an abstraction of the F-measure where the balance of precision and recall in the calculation of the harmonic mean is controlled by a coefficient called beta. Fbeta-measure = (1 + β ^2 ) × Precision × Recall /(β^ 2 × Precision + Recall)
The choice of the β parameter will be used in the name of the Fbeta-measure. For example,
a β value of 2 is referred to as F2-measure or F2-score. A β value of 1 is referred to as the
F1-measure or the F1-score. Three common values for the beta parameter are as follows:
F0.5-measure (β = 0.5): More weight on precision, less weight on recall.
F1-measure (β = 1): Balance the weight on precision and recall.
F2-measure (β = 2): Less weight on precision, more weight on recall.
What is one limitation of the threshold metrics (besides not specifically focusing on the minority class)? P 57
One limitation of the Threshold metrics is that they assume that the class distribution observed in the training dataset will match the distribution in the test set and in real data when the model is used to make predictions. This is often the case, but when it is not the case, the performance can be quite misleading. In particular, they assume that the class imbalance present in the training set is the one that will be encountered throughout the operating life of the classifier.
Ranking metrics don’t make any assumptions about class distributions. True/False? P 57
True
Unlike Threshold metrics that assume that the class distribution observed in the training dataset will match the distribution
in the test set and in real data when the model is used to make predictions.
Rank metrics are more concerned with evaluating classifiers based on how effective they are at … P 58
Separating classes
How do rank metrics evaluate a model’s performance? P 58
These metrics require that a classifier predicts a score or a probability of class membership. From this score, different thresholds can be applied to test the effectiveness of classifiers. Those models that maintain a good score across a range of thresholds will have good class separation and will be ranked higher.
What is the most commonly used ranking metric? P 58
The ROC Curve or ROC Analysis
How’s a no skill model shown as ROC-AUC curve ?
A classifier that has no skill (a no skill model is one that can’t discriminate between the classes and would predict a random class or a constant class in all cases) will be represented by a
diagonal line from the bottom left to the top right.
Although generally effective, the ROC Curve and ROC AUC can be optimistic under a severe class imbalance, especially when the number of examples in the minority class is small. True/False? P 59
True
An alternative to the ROC Curve is the …curve that can be used in a similar way, although focuses on the performance of the classifier on the minority class. P 59
Precision-recall
again, different thresholds are used on a set of predictions by a model, and in this case, the precision and recall are calculated.
How’s a no skill curve on precision-recall metric? P 59
A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset. For a balanced dataset this will be 0.5.
What are probabilistic metrics designed for specifically? P 60
Probabilistic metrics are designed specifically to quantify the uncertainty in a classifier’s predictions.
When are probabilistic metrics used? P 60
These are useful for problems where we are less interested in incorrect vs. correct class predictions and more interested in the uncertainty the model has in predictions and penalizing those predictions that are wrong but highly confident.
These measures are especially useful when we want an assessment of the reliability of the classifiers, not only measuring when they fail but whether they have selected the wrong class with a high or low probability.
Perhaps the most common metric for evaluating predicted probabilities is … for binary classification (or the…), or known more generally as … Another popular score for predicted probabilities is the …. P 61
log loss, negative log likelihood, cross-entropy, Brier score
What’s the benefit of the Brier score compared to log loss when dealing with imbalanced datases? P 61
Brier score is focused on the positive class, which for imbalanced classification is the minority class. This makes it more preferable than log loss, which is focused on the entire probability distribution.
What is log loss and Brier score formula? P 61
LogLoss = −((1 − y) × log(1 − yhat) + y × log(yhat)) BrierScore = (1 /N) sum( (yhati − yi) ^2)
Can Brier score be used for multi-class problems? P 61
Although typically described in terms of binary classification tasks, the Brier score can also be calculated for multiclass classification problems.