Evaluation Flashcards

1
Q

Training dataset

A

= attributes + labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Input

A

set of annotated training instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Output

A

an accurate estimate of the target function underlying the training instance for unseen/testing instance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Inductive Learning Hypothesis

A

any hypothesis found to approximate/generalize the target function well over a sufficient large training data set will also approximate the target function well over held out/unseen test examples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Three interest points in evaluating a classifier

A
  • Overfitting
  • Consistency
  • Generalization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Overfitting

A

fit the training data set too well and did a bad job in generalizing the concept

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Consistency

A

how well the model/classifier perform on the training data; does it flawlessly predict all the labels correctly?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Generalization

A

the opposite of overfitting; how well the classifier generalizes from the training instances to predict the target function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Classification Evaluation aims to ?

A

find evidence of consistency and non-overfitting and evidence that support the idea of inductive learning hypothesis (generalization)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Generalization

A

the proportion of the time the class label is correctly labelled for the test instance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A good model should?

A

fit the training data well and generalise well to unseen data

An overfitting model has poorer generalisation than a model with high training error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Learning Curves

A

% of training data versus testing/test Accuracy

Represent the performance of a fixed learning strategy over different sizes of training data, for a fixed evaluation metric; it can also show how much data need to be used in order to achieve certain degree of accuracy

Allow us to visualise the data trade-off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Learning Curves Trend Summary

A
  • Too little training instances are used = poor performance on both training and testing data
  • the peak of the training accuracy = overfitting because the model fits too well with the training data but perform poorly on the testing data
  • training accuracy starts to drop as the model tries to generalize better on the concept and thus the test accuracy starts to rise (the better job the model does in generalization the narrower the gap between training and test accuracy)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Relationship between the size of training data and accuracy

A

Generally, the more training instances are used the better the accuracy for the testing data sets because there are more examples for the computer to learn/generalize the concept and thus predict better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Apparent error rate

A

the error rate that we got from evaluating the training data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True error rate

A

the error rate that we got from evaluating the testing data set / real instances

With unlimited samples used as training instances, apparent error rate will eventually become the true error rate

True error rate is almost always much higher than training error rate because of overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

difference between true error rate & error rate of the test dataset?

A

The true error rate is just an estimation with a risk of overfitting
• too fit for the development data or
• only have good accuracy for the one test dataset or not the others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why we want to know the true error?

A

this is how we know how well the model do in generalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Possible evidence of overfitting?

A

• large gap between training and test accuracy in the learning curve
• Complex decision boundary (which is distorted by noise data)
• Lack of coverage of population in the sample data, due to either
o small number of samples or
o non-randomness in the sample dataset (sampling bias)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Bias and Variance

A

model bias
evaluation bias
sampling bias

model variance
evaluation variance

It’s rather hard to tell evaluation and model bias & variance apart

These are informal definitions and they can not be measured quantitatively.

A biased classifier is guaranteed to be making errors; an unbiased classifier might/might not be making errors

Although high bias and high variance are often bad, but it does not mean low bias and low variance are good. They’re generally desirable, all else equal.

Bias is generally binary (black-or-white) whilst variance is generally relative (to other classifiers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

model bias

A

wrong predictions due to the propensity of the classifier; relates to accuracy

o In term of regression problems:
 bias = the average of errors
 a model is biased if the predictions are systematically higher or lower than the true value
 a model is not biased if the predictions are correctly predicted OR some instances are higher and some are lower than the true values

o	In term of classification problems:
	a model is biased if the class distribution of the prediction is the not same with the test dataset
	a model is not biased if the class distribution of the prediction is the same with the test dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

evaluation bias

A

over or under estimate the effectiveness of the classifier due to the propensity of the evaluation strategy

o the estimate of effectiveness of a model is systematically too low or too high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

sampling bias

A

the training dataset does not fully represent the population ( -> inductive learning hypothesis broken)

24
Q

model variance

A

o In term of regression problems:
 Compare the average of the squared predictions with the square of the average of predictions
 Not sure how to interpret this

o In term of classification problems:
 a model has low variance when different randomly sampled training datasets lead to similar predictions/model (independent from whether the predictions are correct or not)
 a model has high variance when different randomly sampled training datasets lead to very different predictions/model

25
Q

evaluation variance

A

o the estimate of the effectiveness of a model changes a lot when we alter the training dataset

26
Q

How to control bias and variance?

A

 Holdout partition size
o More training data: less model variance, more evaluation variance
o Less training data: the other way around

 Repeated random subsampling & M-fold Cross-validation
o Less variance than holdout

 Stratification
o Less variance than holdout

 Leave-one-out cross validation
o No sampling bias at all
o Lowest bias/variance in general

27
Q

Low Bias Classifier

A

weighted random classifier, polynomial/RBF kernel SVM

28
Q

Low variance classifier

A

0-R, naïve Bayes

29
Q

True Positive and True Negative

A

If predicted and actual result match:
• if both True -> TP
• if both False -> TN

30
Q

False Positive and False Negative

A

If predicted and actual result don’t match:
• if test = False, reality=True -> False Negative
• if test=True, reality= False -> False Positive

31
Q

Classification Accuracy (ACC)

A

the proportion of instances for which we have correctly predicted the label

32
Q

Error rate (ER)

A

the proportion of instances for which we have incorrectly predicted the label

33
Q

Error rate reduction (ERR)

A

compare the ER of a given method with that of an alternative method (ER0)

34
Q

Metrics related to interesting class only

A

precision and recall

35
Q

Precision/positive predictive value

A

when we predict an instance is interesting, how often are we correct?

36
Q

Recall/sensitivity

A

Out of all the instances that are truly interesting, how many did we correctly predict?

37
Q

why there is a trade-off between precision and recall ?

A

because it is a direct trade-off between FP and FN

38
Q

What is F-score?

A

the weighted harmonic mean of precision and recall

39
Q

F-1 score interpretation

A

When F-1 score = 1 (max F-1 score), perfect precision and recall

When F-1 score = 0 (min F-1 score), worst precision and recall

40
Q

Multi-class classification

A

There is no uninteresting class in multi-class classification.

So we use Confusion Matrix because of the technical definition of accuracy behaves strangely

41
Q

Precision & Recall for multiple categories problem

A
  • They are calculated per-class
  • Micro-averaging: combine all test instances into a single pool
  • Macro-averaging: calculate P, R per category and then average (by the number of category)
  • Weighted averaging: calculate P, R per class and then average, based on the
  • proportion of instances in that class

If there is a small category aka a category that has very few instances compared to other categories, using the micro-averaging method is going to make the small category invisible while the macro-averaging method is going to give equal weight to the small category as other big categories

People often calculate both so that they can see from different perspectives

Problems:
 when to do averaging?
 “?” class

42
Q

Clustering Accuracy

A

Highest accuracy = fairest

43
Q

Holdout

A

train a classifier over a fixed training dataset and evaluate it over a fixed held-out test dataset

Each instance is randomly assigned to either the test or training dataset; there is no overlapping data between the two datasets (partitioned)

44
Q

Pros and Cons of holdout

A

Pros:
• Simple to work with
• High reproducibility (same split ratio)

Cons:
• Trade-off between more training and test data (variance vs. bias)
• Representativeness of the training and test data (something might happen only in the test data but not the training data and the model never get to learn it; high bias because there is a mismatch between the training and test data sets); solution: random subsampling

45
Q

Random Subsampling

A

perform holdout over multiple iterations, randomly selecting the training and test data while maintaining a fixed size for each dataset on each iteration. Evaluate by taking the average across the iterations.

46
Q

Pros and Cons of random subsampling

A

Pros:
• Reduce variance and bias over holdout method  more reliable result

Cons:
• Reproducibility (because of the randomness)
• Slower than holdout
• Wrong raining set – test set size might lead to misleading result, like holdout

47
Q

M-fold Cross Validation

A

Split the data into M equal partitions. Take one partition as the test data and the rest as the training data. Train the system M times and the average performance is computed across the M runs.

M typically = 5 or 10.

Evaluation are calculated based on the entire dataset

Better than random sub-sampling & holdout

Number of folds directly impacts runtime and size of datasets:
 Fewer folds: more instances per partition, more variance in performance estimates
 More folds: fewer instances per partition, less variance but slower

48
Q

Pros and Cons of M-fold CV

A

Pros:
• Training the system M times instead of N times
• Can measure the stability of the system across different training/test combination
• Very reproducible
• Minimise bias and variance on the estimate of the classifier’s performance

Cons:
• The value of M is subjective so might lead to bias (as training data might be different from the test data)
• The result will not be unique unless we always partition the data identically
• Will get slightly lower accuracy compared to leave-one-out method

49
Q

Leave–One–Out Cross–Validation

A

 M = N (train on all data except one data point)
 Maximize the training data and mimics the actual testing behaviour (every test instance is independent)
 Too computationally expensive

50
Q

Stratification

A

the process of rearranging the data as to ensure each fold is a good representative of the whole

51
Q

Inductive Learning Hypothesis

A

Any hypothesis found to approximate the target function well over (a sufficiently large) training data set will also approximate the target function well over unseen test examples

52
Q

Inductive bias (assumptions must be made about the data to build a model and make predictions):

A

o Different assumptions will lead to different predictions

o In order to optimize performance, we need some sort of priori knowledge as there is no free lunch in ML

53
Q

Stratification

A

 Assumes that class distribution of unseen instances will be the same as distribution of seen instances

 When constructing holdout/CV partitions, ensure that training data and test data both have the same class distribution as dataset as a whole

54
Q

Result Comparison

A

Baseline
Benchmark
Random Baseline

55
Q

Baseline

A

naïve method which we would expect any reasonably well-developed method to better (aka minimum/dumb and simple)

  • Sometimes out-perform the complicated methods
  • Valuable in getting a feel of how difficult the classification task is
  • In formulating a baseline for a medical task, we need to be sensitive to the importance of positives and negatives in the classification tasks
56
Q

Benchmark

A

established rival technique which we are pitching our method against (aka reasonable), like past performance

57
Q

Random Baseline

A

 randomly assign a class to each test instance

 randomly assign a class to each test instance, weighting the class assignment according to class distribution of the training class (if we know the prior probability)

 Zero-R/majority class baseline (not suitable for task that discover needles in the haystack); zero attribute is used; only based on the class labels

 One-R: based on one attribute only; for every value of each attribute, use zero-R, pick the one that has the lowest total error rate
o Pros:
 Easy & simple to understand and implement
 Good results
o Cons:
 Unable to capture attribute interactions
 Bias toward attribute that has a lot off values