Week 6 - Machine Learning I Flashcards

Question

How do you split the dataset into train and test sets?

Answer 1

If we have a large dataset (e.g., 500,000) we can choose a smaller split, but with smaller datasets, we may need to choose a larger split for the test set

Answer 2

Testing the model to determine how well it generalises to new cases using test data

Answer 3

1. Model performance could be really dependent on the training data 2. Would like to be more careful 3. Cross-validation techniques can be used to validate the generalisation ability of the machine learning models 4. Cross-validation can be used on binary and multi-classification tasks

Answer 4

K-fold cross validation is one of the most widely used cross validation techniques for classification models. It splits the data randomly into k number of folds and at each iteration (from 1-k) use k-1 folds for training and remaining 1-fold is used for testing Normally this is done on training data, and make a final evaluation on an unseen test set

Answer 5

You can use the following performance metrics: 1. Precision/recall trade-off 2. F1 Score (F-score/F-measure) 3. Recall metric 4. Precision metric 5. Accuracy metric 6. Confusion matrix 7. ROC (receiver operating characteristics) 8. AUC (area under the curve)

Answer 6

The metrics used for measuring and evaluating the performance of machine learning models

Answer 7

The confusion matrix is a table that helps to better visualize the performance of classifiers. More concise metrics can be computer from the confusion matrix (accuracy, precision, recall and f1-score). Look at nb for matrix

Answer 8

Using the confusion matrix, you can calculate the accuracy of a classifier as: accuracy = (TP + TN)/ (TP + FP + FN + TN) This is a widely used metric but can be misleading if used on the imbalanced dataset (a common situation for real-world problems) Note: avoid using accuracy metrics alone for classification problems

Answer 9

The ratio of correctly classified observations to the total observations predicted to be positive. Usually they emphasise the accuracy of positive predictions. precision = TP / (TP + FP)

Answer 10

The recall metric is the ratio of correctly classified observations to the total observations that are in fact positive. This is also known as sensitivity. recall = TP/TP + FN

Answer 11

F1 score incorporates both recall and precision. The harmonic mean of the precision and recall will (1) give more weight to lower values than the arithmetic mean (2) F1 is high if both the recall and precision are high F1 score = 2*(precision*recall)/precision + recall

Answer 12

By increasing the precision reduces recall and vice versa, known as the precision/recall trade-off. Hence, we can denote the relationship as: Classifier A: low recall, higher precision Classifier B: high recall, lower precision

Answer 13

This is used to show the performance of binary classification. This displays the tradeoff between True Positive Rate (TRP) and FPR (False Positive Rate) for various thresholds

Answer 14

TPR (sensitivity, recall): ratio of positive classes that were classified correctly sensitivity = TP / TP + FN FPR (1-specificity): specificity is the ratio of the negative class that were classified correctly specificity = TN / TN + FP

Answer 15

Class probabilities is a threshold that can be used to assign probability values to classes (obtain the score for the ROC curve). Each instance in the classifier has a class probability (between 0 and 1). A representation (visual) is good to show all the possible values across.

Answer 16

Measure the area under the curve (AUC) which is a value between 0 and 1. A perfect classifier will have an AUC = 1 Purely random classifier will have an AUC = 0.5