Evaluation Flashcards

Question

evaluation variance

Answer 1

o the estimate of the effectiveness of a model changes a lot when we alter the training dataset

Answer 2

 Holdout partition size o More training data: less model variance, more evaluation variance o Less training data: the other way around  Repeated random subsampling & M-fold Cross-validation o Less variance than holdout  Stratification o Less variance than holdout  Leave-one-out cross validation o No sampling bias at all o Lowest bias/variance in general

Answer 3

weighted random classifier, polynomial/RBF kernel SVM

Answer 4

0-R, naïve Bayes

Answer 5

If predicted and actual result match: • if both True -> TP • if both False -> TN

Answer 6

If predicted and actual result don’t match: • if test = False, reality=True -> False Negative • if test=True, reality= False -> False Positive

Answer 7

the proportion of instances for which we have correctly predicted the label

Answer 8

the proportion of instances for which we have incorrectly predicted the label

Answer 9

compare the ER of a given method with that of an alternative method (ER0)

Answer 10

precision and recall

Answer 11

when we predict an instance is interesting, how often are we correct?

Answer 12

Out of all the instances that are truly interesting, how many did we correctly predict?

Answer 13

because it is a direct trade-off between FP and FN

Answer 14

the weighted harmonic mean of precision and recall

Answer 15

When F-1 score = 1 (max F-1 score), perfect precision and recall When F-1 score = 0 (min F-1 score), worst precision and recall

Answer 16

There is no uninteresting class in multi-class classification. So we use Confusion Matrix because of the technical definition of accuracy behaves strangely

Answer 17

* They are calculated per-class * Micro-averaging: combine all test instances into a single pool * Macro-averaging: calculate P, R per category and then average (by the number of category) * Weighted averaging: calculate P, R per class and then average, based on the * proportion of instances in that class If there is a small category aka a category that has very few instances compared to other categories, using the micro-averaging method is going to make the small category invisible while the macro-averaging method is going to give equal weight to the small category as other big categories People often calculate both so that they can see from different perspectives Problems:  when to do averaging?  “?” class

Answer 18

Highest accuracy = fairest

Answer 19

train a classifier over a fixed training dataset and evaluate it over a fixed held-out test dataset Each instance is randomly assigned to either the test or training dataset; there is no overlapping data between the two datasets (partitioned)

Answer 20

Pros: • Simple to work with • High reproducibility (same split ratio) Cons: • Trade-off between more training and test data (variance vs. bias) • Representativeness of the training and test data (something might happen only in the test data but not the training data and the model never get to learn it; high bias because there is a mismatch between the training and test data sets); solution: random subsampling

Answer 21

perform holdout over multiple iterations, randomly selecting the training and test data while maintaining a fixed size for each dataset on each iteration. Evaluate by taking the average across the iterations.

Answer 22

Pros: • Reduce variance and bias over holdout method  more reliable result Cons: • Reproducibility (because of the randomness) • Slower than holdout • Wrong raining set – test set size might lead to misleading result, like holdout

Answer 23

Split the data into M equal partitions. Take one partition as the test data and the rest as the training data. Train the system M times and the average performance is computed across the M runs. M typically = 5 or 10. Evaluation are calculated based on the entire dataset Better than random sub-sampling & holdout Number of folds directly impacts runtime and size of datasets:  Fewer folds: more instances per partition, more variance in performance estimates  More folds: fewer instances per partition, less variance but slower

Answer 24

Pros: • Training the system M times instead of N times • Can measure the stability of the system across different training/test combination • Very reproducible • Minimise bias and variance on the estimate of the classifier’s performance Cons: • The value of M is subjective so might lead to bias (as training data might be different from the test data) • The result will not be unique unless we always partition the data identically • Will get slightly lower accuracy compared to leave-one-out method

Answer 25

 M = N (train on all data except one data point)  Maximize the training data and mimics the actual testing behaviour (every test instance is independent)  Too computationally expensive

Answer 26

the process of rearranging the data as to ensure each fold is a good representative of the whole

Answer 27

Any hypothesis found to approximate the target function well over (a sufficiently large) training data set will also approximate the target function well over unseen test examples

Answer 28

o Different assumptions will lead to different predictions | o In order to optimize performance, we need some sort of priori knowledge as there is no free lunch in ML

Answer 29

 Assumes that class distribution of unseen instances will be the same as distribution of seen instances  When constructing holdout/CV partitions, ensure that training data and test data both have the same class distribution as dataset as a whole

Answer 30

Baseline Benchmark Random Baseline

Answer 31

naïve method which we would expect any reasonably well-developed method to better (aka minimum/dumb and simple) * Sometimes out-perform the complicated methods * Valuable in getting a feel of how difficult the classification task is * In formulating a baseline for a medical task, we need to be sensitive to the importance of positives and negatives in the classification tasks

Answer 32

established rival technique which we are pitching our method against (aka reasonable), like past performance

Answer 33

 randomly assign a class to each test instance  randomly assign a class to each test instance, weighting the class assignment according to class distribution of the training class (if we know the prior probability)  Zero-R/majority class baseline (not suitable for task that discover needles in the haystack); zero attribute is used; only based on the class labels  One-R: based on one attribute only; for every value of each attribute, use zero-R, pick the one that has the lowest total error rate o Pros:  Easy & simple to understand and implement  Good results o Cons:  Unable to capture attribute interactions  Bias toward attribute that has a lot off values

Evaluation Flashcards

(57 cards)