Ensembles Flashcards Preview

AML > Ensembles > Flashcards

Flashcards in Ensembles Deck (44)
Loading flashcards...
1

Two characteristics of good ensembles

1.Individual models should be strong
2. The correlation between the models in the ensemble should be weak (diversity)

2

What is bagging?

- Trains N models in parallel using bootstrapped data samples, from an overall training set
- Aggregates using majority voting
- Bootstrapped aggregating = bagging

3

Bayes Optimal Ensemble

-An ensemble of all hypotheses in the hypothesis space
- On average, no other ensemble can outperform it
- not possible to practically implement
- Tom Mitchell book (1997)

4

What some practical approaches to ensembling?

1. Bagging
2. Random Forests
3. Boosting
4. Gradient Boosting
5. Stacking

5

Subagging

Bagging, but where we sample without replacement

- Used in large datasets where we want to create bootstrap samples that are smaller than the original dataset

6

What type of learning algorithms are suited to bagging?

- Decision Trees

- DTs are very sensitive to changes and so small changes in a dataset can result in a different feature being selected to split the dataset at the root, or high up in the tree and this can have a ripple effect throughout the subtrees under this node

7

What is subspace sampling?

- A bootstrap sample only uses a randomly selected subset of the descriptive features in the dataset
- Encourages further diversity of the trees within the ensemble
- Has the advantage of reducing training time for each tree

8

Random Forests

= Bagging + Subspace sampling

9

Advantages of Bagging over Boosting

1. Simpler to implement + parallelize
2. Ease of use
3. Reduced training time

However, Caruana et al. (2008) showed that boosted DT ensembles performed best for datasets < 4000 desc. features.
> 4000 descriptive features -> random forests (based on bagging) performed better
> Boosted ensembles may be prone to overfitting and in domains with large numbers of features, overfitting becomes a serious probelm

10

Costs of using ensembles

1. Increased model complexity
2. Increasing learning

11

What types of algorithms does Bagging work well for?

Unstable algos - algos whose output classifier undergoes major changes in response to small changes in the training data

12

Examples of unstable classifiers

DTs, NNs and rule-learning algos are all unstable

13

Examples of stable classifiers

-Linear regression
- Nearest neighbour
- Linear threshold algorithm

14

Bootstrap replicate

Training set created from bagging procedure

Contains on average 63.2% of the original training set, with several training examples appearing multiple times (Diettrich paper)

15

Methods for manipulating training datasets

1. Bagging
2. K-fold cross-validation (ensembles constructed in this way are called 'cross-validation committees')
3. Boosting

16

AdaBoost

Freund & Schapire (1995-8)

- manipulates the training examples to generate multiple hypotheses
- maintains a set of weights over the the training examples
- Effect of the change in weights is to place more weight on training examples that were misclassified by h(l) and less weight on examples that were correctly classified
- In subsequent iterations, therefore, Adaboost constructs progressively more difficult learning problems
- final classifier is constructed by a weighted vote of the individual classifiers. Each classifier is weighted (by w(l)) according to its accuracy on the weighted training set that it was trained on
- can be viewed as trying to maximise the margin (confidence of accuracy) on the training data
- constructs each new DT to eliminate 'residual errors' that have not been properly handled by the weighted vote of the previously-constructed trees.
- thus, it is directly trying to optimise the weighted vote
and is therefore making a direct attack on the representational problem
- directly optimising an ensemble can increase the rik of overfitting
- in high-noise cases, Adaboost puts a large amount of weight on the mislabelled examples and this leads to it overfitting badly

17

Bagging vs AdaBoost

Diettrich paper: AdaBoost typically outperforms Bagging but when 20% artificial classification noise was added, AdaBoost overfit the data badly while Bagging did very well in the presence of noise.

18

Why doesn't Adaboost overfit more often?

- Stage-wise nature of AdaBoost
- In each iteration, it reweights the training examples, constructs a new hypothesis, and chooses a weight for that hypothesis.
- HOWEVER, it never backs up and modifies the previous choices of hypotheses or weights that it has made to compensate for this new hypothesis

19

What is an ensemble?

A prediction model that is composed of a set of models

20

What is the motivation behind ensembles?

- A committee of experts working together on a problem are more likely to solve it successfully than a single expert working alone
- however, should still avoid GROUPTHINK (i.e. each model should make predictions independently of the other models in the ensemble)

21

Two defining characteristics of ensembles

1. Use a modified version of dataset
2. Aggregates predictions from many models

22

How can ensembles lead to good predictions from base learners that are only marginally better than random guessing?

- Given a large population of independent models, an ensemble can be very accurate even if the individual models in the ensemble perform only marginally better than random guessing

23

What are the 2 standard approaches to ensembling?

1. Bagging
2. Boosting

24

Bagging

Bootstrap aggregation

25

Bagging

Each model in the ensemble is trained on a random sample of the dataset

- each random sample is the same size as the original dataset (unless subagging is used)

- sampling with replacement is used

26

Boosting

Each new model added to an ensemble is biased to pay more attention to instances that previous models misclassified

27

How does boosting work?

By incrementally adapting the dataset used to train the models

- uses a weighted dataset
- each instance has an associated weight (w(i) >= 0)
- weights are initially set to 1/n (n=number of examples)
- these weights are used as a distribution over which the dataset is sampled to create a replicated training set
- number of times that an instance is replicated is proportional to its weight

28

How does boosting work?

Iteratively creating models and adding them to the ensemble

29

When do the iterations stop in boosting?

- predefined number of models have been added
- model's accuracy dips below 0.5

30

Assumptions of boosting algorithm

1. Accuracy of models > 0.5
2. Assumes it's a binary classification problem