Ensembles Flashcards Preview

AML > Ensembles > Flashcards

Flashcards in Ensembles Deck (44)
Loading flashcards...

Two characteristics of good ensembles

1.Individual models should be strong
2. The correlation between the models in the ensemble should be weak (diversity)


What is bagging?

- Trains N models in parallel using bootstrapped data samples, from an overall training set
- Aggregates using majority voting
- Bootstrapped aggregating = bagging


Bayes Optimal Ensemble

-An ensemble of all hypotheses in the hypothesis space
- On average, no other ensemble can outperform it
- not possible to practically implement
- Tom Mitchell book (1997)


What some practical approaches to ensembling?

1. Bagging
2. Random Forests
3. Boosting
4. Gradient Boosting
5. Stacking



Bagging, but where we sample without replacement

- Used in large datasets where we want to create bootstrap samples that are smaller than the original dataset


What type of learning algorithms are suited to bagging?

- Decision Trees

- DTs are very sensitive to changes and so small changes in a dataset can result in a different feature being selected to split the dataset at the root, or high up in the tree and this can have a ripple effect throughout the subtrees under this node


What is subspace sampling?

- A bootstrap sample only uses a randomly selected subset of the descriptive features in the dataset
- Encourages further diversity of the trees within the ensemble
- Has the advantage of reducing training time for each tree


Random Forests

= Bagging + Subspace sampling


Advantages of Bagging over Boosting

1. Simpler to implement + parallelize
2. Ease of use
3. Reduced training time

However, Caruana et al. (2008) showed that boosted DT ensembles performed best for datasets < 4000 desc. features.
> 4000 descriptive features -> random forests (based on bagging) performed better
> Boosted ensembles may be prone to overfitting and in domains with large numbers of features, overfitting becomes a serious probelm


Costs of using ensembles

1. Increased model complexity
2. Increasing learning


What types of algorithms does Bagging work well for?

Unstable algos - algos whose output classifier undergoes major changes in response to small changes in the training data


Examples of unstable classifiers

DTs, NNs and rule-learning algos are all unstable


Examples of stable classifiers

-Linear regression
- Nearest neighbour
- Linear threshold algorithm


Bootstrap replicate

Training set created from bagging procedure

Contains on average 63.2% of the original training set, with several training examples appearing multiple times (Diettrich paper)


Methods for manipulating training datasets

1. Bagging
2. K-fold cross-validation (ensembles constructed in this way are called 'cross-validation committees')
3. Boosting



Freund & Schapire (1995-8)

- manipulates the training examples to generate multiple hypotheses
- maintains a set of weights over the the training examples
- Effect of the change in weights is to place more weight on training examples that were misclassified by h(l) and less weight on examples that were correctly classified
- In subsequent iterations, therefore, Adaboost constructs progressively more difficult learning problems
- final classifier is constructed by a weighted vote of the individual classifiers. Each classifier is weighted (by w(l)) according to its accuracy on the weighted training set that it was trained on
- can be viewed as trying to maximise the margin (confidence of accuracy) on the training data
- constructs each new DT to eliminate 'residual errors' that have not been properly handled by the weighted vote of the previously-constructed trees.
- thus, it is directly trying to optimise the weighted vote
and is therefore making a direct attack on the representational problem
- directly optimising an ensemble can increase the rik of overfitting
- in high-noise cases, Adaboost puts a large amount of weight on the mislabelled examples and this leads to it overfitting badly


Bagging vs AdaBoost

Diettrich paper: AdaBoost typically outperforms Bagging but when 20% artificial classification noise was added, AdaBoost overfit the data badly while Bagging did very well in the presence of noise.


Why doesn't Adaboost overfit more often?

- Stage-wise nature of AdaBoost
- In each iteration, it reweights the training examples, constructs a new hypothesis, and chooses a weight for that hypothesis.
- HOWEVER, it never backs up and modifies the previous choices of hypotheses or weights that it has made to compensate for this new hypothesis


What is an ensemble?

A prediction model that is composed of a set of models


What is the motivation behind ensembles?

- A committee of experts working together on a problem are more likely to solve it successfully than a single expert working alone
- however, should still avoid GROUPTHINK (i.e. each model should make predictions independently of the other models in the ensemble)


Two defining characteristics of ensembles

1. Use a modified version of dataset
2. Aggregates predictions from many models


How can ensembles lead to good predictions from base learners that are only marginally better than random guessing?

- Given a large population of independent models, an ensemble can be very accurate even if the individual models in the ensemble perform only marginally better than random guessing


What are the 2 standard approaches to ensembling?

1. Bagging
2. Boosting



Bootstrap aggregation



Each model in the ensemble is trained on a random sample of the dataset

- each random sample is the same size as the original dataset (unless subagging is used)

- sampling with replacement is used



Each new model added to an ensemble is biased to pay more attention to instances that previous models misclassified


How does boosting work?

By incrementally adapting the dataset used to train the models

- uses a weighted dataset
- each instance has an associated weight (w(i) >= 0)
- weights are initially set to 1/n (n=number of examples)
- these weights are used as a distribution over which the dataset is sampled to create a replicated training set
- number of times that an instance is replicated is proportional to its weight


How does boosting work?

Iteratively creating models and adding them to the ensemble


When do the iterations stop in boosting?

- predefined number of models have been added
- model's accuracy dips below 0.5


Assumptions of boosting algorithm

1. Accuracy of models > 0.5
2. Assumes it's a binary classification problem