Decision Trees Flashcards
(15 cards)
Root Node
Decision Tree
Topmost node that represents the best predictor. Represents the entire population and therefore can be further divided into multiple homogeneous sets.
Splitting
Decision Tree
the process to split a node into multiple (2 or more) sub-nodes
Decision node
Decision Tree
a sub-node leading to other subsequent sub-nodes (children nodes)
leaf node
Decision Tree
a node that does not have any children
Pruning
Decision Tree
Trim nodes to reduce the number of nodes and the size of the tree
Branch
Decision Tree
sub-section of the tree
Parent and child nodes
Decision Tree
a node that is divided into sub-nodes is know as the parent
the sub-nodes are the children
What is the decision tree algorithm
1) Start with an empty tree
2) Select an attribute for descending to the next level
3) Keep going until you get to the bottom
Entropy
A measure of disorder or impurity.
Entropy = -P1log2(P1) -P0log2(P0)
Information Gain
Tells us how much information a class gives about an attribute.
IG = entropy(parent) - (weighted average*entropy children)
Weighted average = (# value left side child)/total in parent *entropy of left side + (# value right side child)/total in parent *entropy of right side
Usually want to select the attribute with the highest information gain
Can a decision tree be too large?
Yes, a big tree can affect computational efficiency and can lead to overfitting.
Describe the pruning process for a decision tree?
1) use a validation set to determine the effect of post-prunning
2) Use statistics to determine if pruning will enhance the current training set
3) Minimum Description length principle. Its a measure of complexity for encoding decision trees and training set. Stop once encoding is minimized.
Explain how to create a random forest?
Decisions trees work great with data that it was trained on but not so much new data.
Step 1: Create a bootstrap dataset - to make this randomly select samples from the dataset (you can repeat)
Step 2: Create a decision tree using the bootstrap dataset but use a random subset of variables to select a variable
Step 3: Build a tree as usual
Step 4: Go back to step 1 and repeat hundreds of times
When running data through the model run through all the trees and then take the majority vote.
Gini Impurity
Measures how mixed the classes are.
Example:
Suppose we have a dataset split as follows:
10 samples total
4 are Class A
6 are Class B
So:
p A = 10/4 =0.4
p B = 10/6=0.6
Now plug into the formula:
Gini=1−(0.42+0.62)=1−(0.16+0.36)
=1−0.52=0.48
What are the advantages and disadvantages of random forest?
Adv
1) versatile algorithm
2) prediction results more accuratge then decision tree
Disadv
1) Slow computationally since so many trees
2) Does not describe data since a modelling tool