classification (decision trees) Flashcards
What is the goal of decision trees?
To create a model that predicts the value of a target variable by learning simple decision rules inferred from features
What are decision trees?
A non-parametric supervised learning method used for classification
What are benefits of decision trees?
- don’t need as much data and gives rules that it learned
- no underlying assumptions
- handles multidimensional data
- achieves good accuracy
What is the splitting criterion?
Tells us which attribute to test at node N by determining the “best” way to separate or partition tuples into individual classes
What does it mean for a partition to be pure?
If all tuples belong to the same class
What does information gain do?
Minimizes the info needed to classify tuples in resulting partitions and reflects least randomness or “impurity”
(T/F): Info gain guarantees that a simple tree is found
True
What does Gain(A) tell us?
Tells us how much would be gained by branching on A. It is the difference between original info required and new info required.
How can you determine the “best” split point for A if it is CONTINUOUS-VALUED?
- sort of values of A in increasing order
- consider midpoint as split point ( ( ai + ai + 1 ) / 2 )
(T/F): For info gain, pick the feature and split that MAXIMIZES weighted entropy
False; pick the feature and split that MINIMIZES weighted entropy
What kind of DT is info gain used in?
ID3
What kind of DT is the Gini index used in?
CART
What is Gini Index?
An alternative measure of impurity of class labels among a particular set of tuples
How many possible ways are there to form 2 partitions of the data based on a binary split on A?
2^v - 2 ways
(T/F): We want to MINIMIZE the gini index
True
How do you pick the splitting subset for DISCRETE-VALUED attribute using Gini index?
Subset that gives minimum Gini index for that attribute
How do you pick the splitting subset for CONTINUOUS-VALUED attribute using Gini index?
- Consider each possible split-point
- midpoint bw each pair of sorted adjacent values is taken as a possible split point
What is the range of the Gini index for a BINARY label?
[0, 0.50]
What does maximum purity mean?
Every instance has the same label (in the binary case)
What does maximum impurity mean?
50% of data are one label and 50% are another
Explain the bias of info gain.
Biased towards multivalued attributes
Explain the bias of the gain ratio.
adjusts for bias, but prefers unbalanced splits in which one partition is much smaller than others
Explain the bias of Gini index.
biased towards multivalued attributes and has difficulty when the number of classes is large. It also favors tests that result in equal size partitions and purity in both partitions.
What are the 3 partitioning scenarios?
1) A is discrete-valued
2) A is continuous-valued
3) A is discrete-valued and a binary tree