Topic 19 Flashcards
(8 cards)
Decision Trees Bias and Variance
High Bias (model is too simple) High Variance (Model fits training data too well and performs poorly on test data)
Pre Pruning
Stops the tree from growing once it reaches a certain condition
Post Pruning
Removes less significant branches once the tree is fully grown
Purpose of Pruning Methods
Prevents overfitting (reduces complexity), Improves efficiency, enhances interpretability (by reducing unnecessary splits)
Dealing With Continuous Predictors
Create a candidate split midway between each training instance and choose the split with the maximum information gain, from the candidate boundaries where the classification changes
Decision Trees Bagging
Obtain N training datasets from our population, and fit a decision tree on each training dataset separately (some of the fitted trees might be quite different).
Then for a new test item, we run it through each of our N decision trees, and record the predicted class labels.
Each decision tree βvotesβ on what it thinks the class label should be. We pick the class label that gets the most votes across all the trees.
Estimating Accuracys Using Bootstrap Sampling
Because not all of the training items are used in each bag (because we sample with replacement), we
can evaluate the accuracy of each bagged model using the out-of-bag training items. For each training item di, select all trees models π΅~ππ
which were not trained using di. Get the majority vote for di for all of the models in π΅~ππ .
Do this for all di to estimate the total accuracy across all the training items.
Random Forest Algorithm
*
For each feature in a random subset of the features, perform a decision split using that feature and calculate the resulting expected entropy using the current training examples
*
Pick the feature, Fbest, that gives the maximum information gain for that subset of features