Bagging and Random Forest Flashcards
Advanced Model Learning techniques (13 cards)
Base Estimator
An estimator is a single model within a random forest of models. It’s decisions are compared to the larger sample of randomized models.
For variable models in an ensemble technique, the outcome is predicted by:
Voting for a Classification outcome
or
Averaging for a Regression outcome
What is leveraged and perhaps inverted for models within an ensemble technique?
“low computational time” and “High error rates” are leveraged by ensemble techniques when combining models to return a higher/more complex computation and lower error rates.
Why is Independence Important?
Reduces Variance: By having independent trees, the errors of each individual tree are less likely to be correlated. This means that when the predictions are aggregated, the errors tend to cancel each other out, leading to lower variance and a more stable and reliable final prediction.
Improves Generalization: Diversity among trees allows the Random Forest to learn a more robust and generalizable model that is less prone to overfitting the training data.
What does it mean for an Ensemble Technique to have independence in its models?
Independence of model outcomes refers to the idea that each model within the forest makes its predictions without relying on or being influenced by the predictions of other models.
How is independence built into model sampling for Ensemble Technique?
Bootstrapping - Data (rows) are randomly sampled and depending on the sampling technique can still be available for other models to increase independence.
Google answer: During the training phase, each decision tree is built using a different random subset of the training data, drawn with replacement (a process called bootstrap sampling or bagging).
Feature Randomness - Each tree only pulls from a subset of features (columns) as well, further randomizing the model data sets.
Google answer: In addition to random row sampling, Random Forests also introduce feature randomness; each tree only considers a random subset of features when deciding on the best split at each node.
Sampling with Replacement
When drawing a sample of data for each model, a data point is subject to be pulled and available multiple times rather than being eliminated after being selected for further use.
Cross Validation
An evaluation technique on how well a model will generalize to unseen data.
To conduct Cross Validation, split the data into multiple subsets, assign some to training and others to testing, and take an average of the predictions across the different subsets to get a more robust estimate of model performance.
Aggregation
Pooling the predictions across all sub models in an ensemble technique and averaging or voting on them to come to the most performance outcome.
Averaging is for Regression models and Voting is for Classification models.
Bagging
Also known as Bootstrapping Aggregation.
Ensemble Machine Learning technique that uses random sampling and aggregation to improve accuracy and stability of regression and classification models.
Weak models are built in parallel.
What are some benefits of Bagging?
Makes the model more Robust - final prediction is made on the basis of a number of outputs that have been given by a large number of independent models
Protects the model from overfitting to the original data set - individual models do not have access to the original data and are only built on samples that have been randomly chosen from the original data with replacement
Builds models in parallel - the output of individual models is independent of each other
How does sampling with replacement affect the probability of choosing an observation
It keeps the probability of selecting any observation constant throughout the sampling process because the chosen item is returned to the population before the next draw
What is the percentage of samples that get selected (on average) with Sampling with Replacement?
~63%