Flashcards in Ensembles Deck (44)
Two characteristics of good ensembles
1.Individual models should be strong
2. The correlation between the models in the ensemble should be weak (diversity)
What is bagging?
- Trains N models in parallel using bootstrapped data samples, from an overall training set
- Aggregates using majority voting
- Bootstrapped aggregating = bagging
Bayes Optimal Ensemble
-An ensemble of all hypotheses in the hypothesis space
- On average, no other ensemble can outperform it
- not possible to practically implement
- Tom Mitchell book (1997)
What some practical approaches to ensembling?
2. Random Forests
4. Gradient Boosting
Bagging, but where we sample without replacement
- Used in large datasets where we want to create bootstrap samples that are smaller than the original dataset
What type of learning algorithms are suited to bagging?
- Decision Trees
- DTs are very sensitive to changes and so small changes in a dataset can result in a different feature being selected to split the dataset at the root, or high up in the tree and this can have a ripple effect throughout the subtrees under this node
What is subspace sampling?
- A bootstrap sample only uses a randomly selected subset of the descriptive features in the dataset
- Encourages further diversity of the trees within the ensemble
- Has the advantage of reducing training time for each tree
= Bagging + Subspace sampling
Advantages of Bagging over Boosting
1. Simpler to implement + parallelize
2. Ease of use
3. Reduced training time
However, Caruana et al. (2008) showed that boosted DT ensembles performed best for datasets < 4000 desc. features.
> 4000 descriptive features -> random forests (based on bagging) performed better
> Boosted ensembles may be prone to overfitting and in domains with large numbers of features, overfitting becomes a serious probelm
Costs of using ensembles
1. Increased model complexity
2. Increasing learning
What types of algorithms does Bagging work well for?
Unstable algos - algos whose output classifier undergoes major changes in response to small changes in the training data
Examples of unstable classifiers
DTs, NNs and rule-learning algos are all unstable
Examples of stable classifiers
- Nearest neighbour
- Linear threshold algorithm
Training set created from bagging procedure
Contains on average 63.2% of the original training set, with several training examples appearing multiple times (Diettrich paper)
Methods for manipulating training datasets
2. K-fold cross-validation (ensembles constructed in this way are called 'cross-validation committees')
Freund & Schapire (1995-8)
- manipulates the training examples to generate multiple hypotheses
- maintains a set of weights over the the training examples
- Effect of the change in weights is to place more weight on training examples that were misclassified by h(l) and less weight on examples that were correctly classified
- In subsequent iterations, therefore, Adaboost constructs progressively more difficult learning problems
- final classifier is constructed by a weighted vote of the individual classifiers. Each classifier is weighted (by w(l)) according to its accuracy on the weighted training set that it was trained on
- can be viewed as trying to maximise the margin (confidence of accuracy) on the training data
- constructs each new DT to eliminate 'residual errors' that have not been properly handled by the weighted vote of the previously-constructed trees.
- thus, it is directly trying to optimise the weighted vote
and is therefore making a direct attack on the representational problem
- directly optimising an ensemble can increase the rik of overfitting
- in high-noise cases, Adaboost puts a large amount of weight on the mislabelled examples and this leads to it overfitting badly
Bagging vs AdaBoost
Diettrich paper: AdaBoost typically outperforms Bagging but when 20% artificial classification noise was added, AdaBoost overfit the data badly while Bagging did very well in the presence of noise.
Why doesn't Adaboost overfit more often?
- Stage-wise nature of AdaBoost
- In each iteration, it reweights the training examples, constructs a new hypothesis, and chooses a weight for that hypothesis.
- HOWEVER, it never backs up and modifies the previous choices of hypotheses or weights that it has made to compensate for this new hypothesis
What is an ensemble?
A prediction model that is composed of a set of models
What is the motivation behind ensembles?
- A committee of experts working together on a problem are more likely to solve it successfully than a single expert working alone
- however, should still avoid GROUPTHINK (i.e. each model should make predictions independently of the other models in the ensemble)
Two defining characteristics of ensembles
1. Use a modified version of dataset
2. Aggregates predictions from many models
How can ensembles lead to good predictions from base learners that are only marginally better than random guessing?
- Given a large population of independent models, an ensemble can be very accurate even if the individual models in the ensemble perform only marginally better than random guessing
What are the 2 standard approaches to ensembling?
Each model in the ensemble is trained on a random sample of the dataset
- each random sample is the same size as the original dataset (unless subagging is used)
- sampling with replacement is used
Each new model added to an ensemble is biased to pay more attention to instances that previous models misclassified
How does boosting work?
By incrementally adapting the dataset used to train the models
- uses a weighted dataset
- each instance has an associated weight (w(i) >= 0)
- weights are initially set to 1/n (n=number of examples)
- these weights are used as a distribution over which the dataset is sampled to create a replicated training set
- number of times that an instance is replicated is proportional to its weight
How does boosting work?
Iteratively creating models and adding them to the ensemble
When do the iterations stop in boosting?
- predefined number of models have been added
- model's accuracy dips below 0.5