Topic 5: Ensemble Methods Flashcards

1
Q

What does ensemble mean

A

Having multiple predictive models and combining their predictions
-> treating them as a “committee”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Committee

A

A group of predictive models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the underlying principle of ensemble methods

A

That in real-world situations, every model has limitations and will make some errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What models can the ensemble algorithm be applied to

A

Most
from simple decision trees to deep CNN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Jensen’s Inequality

A

If X is a random variable and f(x) is a convex function, then:

E [ f(X) ] ≥ f (E [X])

Essentially distance of the mean guess x¯ from the truth is less than (or equal to) the average of the distances for each individual guess xi

So the error of combined values is less than the average error of each individual value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is jensen’s inequality applied to ensemble methods

A

The squared error of the ensemble is guaranteed to be less than or equal to the average squared error of the individual estimators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the relationship between ensemble size and probability of ensemble error

A

If we fix the individual error to ε = 0.3 (Probability that one model makes an error)
And then increase the ensemble size
The probability of ensemble error decrease

If decrease ε, the whole graph (ensemble size v error is squished in the y axis (moved down))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the assumption we are making in ensemble probability models

A

Each model is independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the issue with increasing the number of model in an ensemble with the same set training data

A

In order for each model to be independent (necessary requirement), the training data has to be split n times (for n models) - using the assumption that all data is IID

However, decreasing training dataset sizes means each model is less and less accurate on testing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why do we need parallel algorithms

A

To try and create “diversity” between classifiers in Ensembles, whilst maintaining a reasonable degree of accuracy
Eg Bagging and random forests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the bagging algorithm

A

“Bootstrap AGGregating”

Given a datset size n, randomly select n examples WITH replacement
Eg from (n = 1-6) : 1 1 2 4 5 5
This is a ‘bootstrap’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a data bootstrap

A

a random sample of examples from our original dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

For a classification problem, how do you combine model results

A

Majority vote

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

For a regression problem, how do you combine model results

A

Take the average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can you use non-uniform weights in the bagging algorithm

A

Requires an Extra holdout dataset separate form train and test data
Instead of simply averaging the predictions from each model, different weights are assigned to each model’s prediction based on their performance
This data helps assess the performance of the model with different assigned weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the risk of using non-uniform weights in the bagging algorithm

A

Over fitting
Weights may become overly tuned to specific characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Bagging: are we actually discarding a lot of training data in our random sampling?

A

No, our classifiers are statistically dependent with each other
Almost all of the dataset is certain to be used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the probability of a piece of data being included in a bootstrap

A

p = 1 - (1 - 1/n)^n

Because 1/n probability of selecting it
1 - 1/n probability of not selecting it
(1 - 1/n)^n probability of it never being selected in the whole n size bootstrap

So 1 - (1 - 1/n)^n of it being included in the bootstrap
Which converges roughly to 0..6321

19
Q

Relationship of ensemble size to probability of data being excluded

A

By ensemble size 10 there is a <1% chance that a piece of data is excluded

20
Q

Why are more classifiers dependent on each other for naive bayes than decision trees

A

Because naive bayes has a higher bias
Thus the number of possible models to be generated is highly restricted
Lots of models are practically identical

21
Q

How if overfitting helpful in bootstrapping

A

A single tree using a bootstrap overfits more than a single tree using the whole dataset
This means that the bootstraps are more different from each other and therefore not dependent -> which is helping us

(The ensemble then cancels out these individual overfitting errors)

22
Q

What is random forests

A

An ensemble algorithm only for decision trees
“a forest of randomised trees”

23
Q

How does random forests work

A

Bootstrap then a random selection of features at each split point

Take a bootstrap S′n from Sn
Build a tree using S′n, but, at every split point:
- Choose a random fraction K of the remaining features.
- Pick the best feature from that subset
Add the tree to the set

End
Return set of trees

Essentially the bootstrap tree is forced not to choose the best feature at every split, but in a random way

24
Q

What is the common value of K (random forests)

A

K = √d
where d is the total number of features

25
Q

Random forests vs bagging

A

RF helps reduce the test error further than bagging (as number of trees increases)

26
Q

Why is RF not a precise algorithm

A

It is a “methodology”
there are many ways to randomise the tree structures

27
Q

What is a sequential algorithm

A

where each classifier aims to correct the
errors of its predecessors
Eg Boosting

28
Q

What is Boosting

A

Each stage involves:
- training one new model
- identify which mistakes it made in
the training data
- misclassified training examples are emphasised
- correctly classified examples have their emphasis decreased

This generates a new dataset, which is used to train the next model in the boosting chain

29
Q

How do we model “emphasis”

A

Represented by a probability distribution P

30
Q

What is Adaboost

A

Adaptive Boosting
The most well known of the boosting family of procedures
Incredible fast

31
Q

How does Adaboost work

A

Each classifier is:
ht(x) ∈ {-1, +1}

(Builds one classifier at a time)
from t=1 to M
builds ht
Calculates the error (ϵt) ht makes on the dataset
Assigns it an importance (αt) (higher importance if it performs better)
Udpates the distribution to Pt+1 (new probability dist) which gives more weight to misclassified samples (vice versa)
Renormalises the distribution to ensure that the weights sum up to 1

Finally adaboost determines the weights of all the classifiers - αt

32
Q

What is αt in Adaboost

A

The importance (αt) of the classifier based on its error
If the error is low, the importance will be high, and vice versa
The formula for importance (αt) involves the natural logarithm of the ratio of correct to incorrect predictions

33
Q

Where do bagging and boosting overlap

A

In the first iteration of boosting when
P1 is uniform, this is exactly equivalent to Bagging

34
Q

Why does Adaboost minimise an exponential loss function not itself

A

Because classification is dicontinuous 0/1 it is hard to minimise
Instead Adaboost indirectly minimises itself by minimising its upper bound (an exponential function)

35
Q

How is the exp loss function applied in adaboost

A

The exponential loss function is decomposed into individual terms for each data point

Recursively, the relative weight of each data point at time step t+1 is updated based on the previous weights and the performance of the current classifier

36
Q

what is wt(i)

A

the relative importance of each data point i at time step t

37
Q

What is the “Ambiguity decomposition”

A

The ensemble will always have an error less than or equal to the average

38
Q

How do we minimise adaboost classification error

A

We cannot directly minimise the error because of the discontinuity of the error function
Create an upper bound (on the classification error) that is exponential and minimise that instead

39
Q

How does the exponential bound on adaboost work

A

It assigns a higher penalty when the ensemble makes incorrect predictions
As predictions become more positive (correct) the loss reduces (y axis) and the graph approaches the x asymptote
Opposingly, as predictions become more negative (incorrect) the graph grows exponentially
(negative gradient)

40
Q

What does ‘parallel’ mean

A

Ability to train multiple base models concurrently or independently

41
Q

When does RF work best

A

When there are a large number of features — allowing for lots of diversity in the forest

42
Q

what does overfitting look like in training and testing errors

A

training error significantly < testing error

43
Q

what does underfitting look like in training and testing errors

A

high training and testing errors

44
Q

What is Et in adaboost

A

The loss function at each time step t
Et = 1/n ∑ e^(-αtyiht(xi))