Decision Trees
 An internal node is a test on an attribute.
 A branch represents an outcome of the test, e.g., Color=red.
 A leaf node represents a class label or class label distribution.
 At each node, one attribute is chosen to split training examples into distinct classes as much as possible.
 A new case is classified by following a matching path to a leaf node
Building Decision Tree

Topdown tree construction (Recursive Algo ⇒ End up with huge tree. If model is over fitted means it does not generalize well )

At start, all training examples are at the root.

Partition the examples recursively by choosing one attribute each time.

Bottomup tree pruning

Remove subtrees or branches, in a bottomup manner, to improve the estimated accuracy on new cases
Topdown tree construction (Recursive Algo ⇒ End up with huge tree. If model is over fitted means it does not generalize well )

At start, all training examples are at the root.

Partition the examples recursively by choosing one attribute each time.
Bottomup tree pruning

Remove subtrees or branches, in a bottomup manner, to improve the estimated accuracy on new cases
Choosing the Splitting Attribute
 At each node, available attributes are evaluated on the basis of separating the classes of the training examples.
 A “goodness” function is used for this purpose. Typical goodness functions:
 information gain (ID3/C4.5)
 information gain ratio
 gini index (CART)

Pcrkct split criteria: Splits data in two sides  Yes and No. ⇒ clean subnodes
 information gain (ID3/C4.5)
 information gain ratio
 gini index (CART)
Pcrkct split criteria: Splits data in two sides  Yes and No. ⇒ clean subnodes
A Criterion for Attribute Selection
 Which is the best attribute?
 The one which will result in the smallest tree
 Heuristic: choose the attribute that produces the “purest” nodes
 Popular impurity criterion: information gain
 Information gain increases with the average purity of the subsets that an attribute produces
 Strategy: choose attribute that results in greatest information gain
 The one which will result in the smallest tree
 Heuristic: choose the attribute that produces the “purest” nodes
 Information gain increases with the average purity of the subsets that an attribute produces
Computing Information
 Information is measured in bits
 Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
 Entropy gives the information required in bits (this can involve fractions of bits!)
 Formula for computing the information entropy:
 Defines how ordered a system is
 Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
 Entropy gives the information required in bits (this can involve fractions of bits!)
 Defines how ordered a system is
Example: attribute “Outlook”
Computing the Information Gain
Continuing to Split
The Final Decision Tree
Wish List for a Purity Measure
 Properties we require from a purity measure:
 When node is pure, measure should be zero (=0)
 When impurity is maximal (i.e. all classes equally likely), measure should be maximal (e.g., 1 for boolean values)
 Multistage property: Info[2,3,4]=Info[2,7]+7/9 Info[3,4]
 Entropy is a function that satisfies these properties
 When node is pure, measure should be zero (=0)
 When impurity is maximal (i.e. all classes equally likely), measure should be maximal (e.g., 1 for boolean values)
 Multistage property: Info[2,3,4]=Info[2,7]+7/9 Info[3,4]
Entropy
 Entropy in general, describes the randomness of a system
 Entropy = 0 describes a perfectly ordered system
 The term was used in thermodynamics and and statistical mechanics „ A Mathematical Theory of Communication“ by Claude Shannon 1948 defines entropy and information theory
 „uncertainty about the contents of a message“
 Entropy = 0 describes a perfectly ordered system
 „uncertainty about the contents of a message“
Expected Information Gain
Gain(S,a) is the information gained adding a subtree (Reduction in number of bits needed to classify an instance)
Problems?
HighlyBranching Attributes
 Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values
 Information gain is biased towards choosing attributes with a large number of values
 This may result in overfitting (selection of an attribute that is nonoptimal for prediction)
 Information gain is biased towards choosing attributes with a large number of values
 This may result in overfitting (selection of an attribute that is nonoptimal for prediction)
Split for ID Code Attribute
Entropy of split = 0 (since each leaf node is “pure”), having only one case. Information gain is maximal for ID code
Gain Ratio
 Gain ratio: a modification of the information gain that reduces its bias on highbranch attributes 4 >Corrects ID problem
 Gain ratio takes number and size of branches into account when choosing an attribute
 It corrects the information gain by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to)
Computing the Gain Ratio
Example: intrinsic information for ID code
intrinsic_info(1,1, ,1) = 14 x(  1/14 x log1/14)= 3.807 bits Importance of attribute decreases as intrinsic information grows.
More on the Gain Ratio
 „Outlook” still comes out top
 However:
 “ID code” has still greater gain ratio (0.246)
 Standard fix: ad hoc test to prevent splitting on that type of attribute
 Problem with gain ratio: it may overcompensate
 May choose an attribute just because its intrinsic information is very low
 Standard fix:
 First, only consider attributes with greater than average information gain
 Then, compare them on gain ratio
 “ID code” has still greater gain ratio (0.246)
 Standard fix: ad hoc test to prevent splitting on that type of attribute
 May choose an attribute just because its intrinsic information is very low
 Standard fix:
 First, only consider attributes with greater than average information gain
 Then, compare them on gain ratio
The Splitting Criterion in CART
 Classification and Regression Tree (CART)
 Developed 19741984 by 4 statistics professors
 Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)
 Gini Index is used as a splitting criterion
 Both C4.5 and CART are robust tools
 No method is always superior – experiment!
 Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)
Gini Index for 2 Attribute Values

For example, two classes, Pos(itive) and Neg(ative), and dataset S with p Poselements and n Negelements.

fp = p / (p+n)

fn = n / (p+n)

Gini(S) = 1 – fp^{2}  fn^{2}

If dataset S is split into S1, S2 then Gini_{split}(S_{1}, S_{2}) = (p_{1}+n_{1})/(p+n) ·Gini(S_{1}) + (p_{2}+n_{2})/(p+n) · Gini(S_{2})
For example, two classes, Pos(itive) and Neg(ative), and dataset S with p Poselements and n Negelements.

fp = p / (p+n)

fn = n / (p+n)

Gini(S) = 1 – fp^{2}  fn^{2}
If dataset S is split into S1, S2 then Gini_{split}(S_{1}, S_{2}) = (p_{1}+n_{1})/(p+n) ·Gini(S_{1}) + (p_{2}+n_{2})/(p+n) · Gini(S_{2})
Gini Index for 2 Attribute Values Example
Gini Index 2 Attributes Example 2
Select the split that decreases the Gini Index most. This is done over all possible places for a split and all possible variables to split.
Generate_DT(samples, attribute_list)
 Create a new node N;
 If samples are all of class C then label N with C and exit;
 If attribute_list is empty then label N with majority_class(samples) and exit;
 Select best_split from attribute_list;
 For each value v of attribute best_split:
 Let S_v = set of samples with best_split=v ;
 Let N_v = Generate_DT(S_v, attribute_list \ best_split) ;
 Create a branch from N to N_v labeled with the test best_split=v;
 Let S_v = set of samples with best_split=v ;
 Let N_v = Generate_DT(S_v, attribute_list \ best_split) ;
 Create a branch from N to N_v labeled with the test best_split=v;
Time Complexity of Basic Algorithm

Let m be the number of attributes

Let n be the number of instances

Assumption: Depth of tree is O(log n)

Usual assumption if the tree is not degenerate

For each level of the tree all n instances are considered (best = vi)

O(n log n) work for a single attribute over the entire tree

Total cost is O(mn log n) since all attributes are eventually considered.

Without pruning (see next class)
Let m be the number of attributes
Let n be the number of instances
Assumption: Depth of tree is O(log n)

Usual assumption if the tree is not degenerate
For each level of the tree all n instances are considered (best = vi)

O(n log n) work for a single attribute over the entire tree
Total cost is O(mn log n) since all attributes are eventually considered.

Without pruning (see next class)
Scalability of DT Algorithms
 Need to design for large amounts of data Two things to worry about
 Large number of attributes
 Leads to a large tree
 Takes a long time
 Large amounts of data
 Can the data be kept in memory?
 Some new algorithms do not require all the data to be memory resident
 Leads to a large tree
 Takes a long time
 Can the data be kept in memory?
 Some new algorithms do not require all the data to be memory resident
Relational Rules (Shapes Problem)
If width > height then lying
If height > width then standing

Rules comparing attributes to constants are called propositional rules (propositional data mining)

Relational rules are more expressive in some cases (new column: height / width)

define relations between attributes (relational data mining)

most DM techniques do not consider relational rules


As a workaround for some cases, one can introduce additional attributes, describing if width > height

allows using conventional “propositional” learners

Propositional Logic
 Essentially, decision trees can represent any function in propositional logic
 A, B, C: propositional variables
 and, or, not, => (implies), <=> (equivalent): connectives
 A proposition is a statement that is either true or false The sky is blue. color of sky = blue
 Decision trees are an example of a propositional learner.
 A, B, C: propositional variables
 and, or, not, => (implies), <=> (equivalent): connectives
C4.5 – An IndustrialStrength Algorithm
 For an algorithm to be useful in a wide range of real world applications it must:
 Permit numeric attributes
 Allow missing values
 Be robust in the presence of noise
 Basic algorithm needs to be extended to fulfill these requirements
 Permit numeric attributes
 Allow missing values
 Be robust in the presence of noise
Numeric Attributes

Unlike nominal attributes, every attribute has many
possible split points

E.g. temp < 45

Standard method: binary splits

Solution is straightforward extension:

Evaluate info gain (or other measure) for every possible split point of attribute

Choose “best” split point

Info gain for best split point is highest info gain for attribute

Numerical attributes can be used several times in a decision tree, nominal attributes only once
Unlike nominal attributes, every attribute has many
possible split points

E.g. temp < 45

Standard method: binary splits
Solution is straightforward extension:

Evaluate info gain (or other measure) for every possible split point of attribute

Choose “best” split point

Info gain for best split point is highest info gain for attribute
Numerical attributes can be used several times in a decision tree, nominal attributes only once
Binary Splits on Numeric Attributes

Splitting (multiway) on a nominal attribute exhausts all information in that attribute

Nominal attribute is tested (at most) once on any path in the tree

Not so for binary splits on numeric attributes!

Numeric attributes may be tested several times along a path in the tree

Disadvantage: tree is hard to read

Remedy:

prediscretize numeric attributes, or } separate class

allow for multiway splits instead of binary ones using the Information Gain criterion
Splitting (multiway) on a nominal attribute exhausts all information in that attribute

Nominal attribute is tested (at most) once on any path in the tree
Not so for binary splits on numeric attributes!

Numeric attributes may be tested several times along a path in the tree
Disadvantage: tree is hard to read
Remedy:

prediscretize numeric attributes, or } separate class

allow for multiway splits instead of binary ones using the Information Gain criterion
Handling Missing Values / Training
 Ignore instances with missing values
 Pretty harsh, and missing value might not be important
 Ignore attributes with missing values
 Again, may not be feasible
 Treat missing value as another nominal value
 Fine if missing a value has a significant meaning
 Estimate missing values
 Data imputation: regression, nearest neighbor, mean, mode, etc.
 Pretty harsh, and missing value might not be important
 Again, may not be feasible
 Fine if missing a value has a significant meaning
 Data imputation: regression, nearest neighbor, mean, mode, etc.
Overfitting
 Two sources of abnormalities
 Noise (randomness)
 Outliers (measurement errors)
 Chasing every abnormality causes overfitting
 Decision tree gets too large and complex
 Good accuracy on training set, poor accuracy on test set
 Does not generalize to new data any more
 Solution: prune the tree
 Noise (randomness)
 Outliers (measurement errors)
 Decision tree gets too large and complex
 Good accuracy on training set, poor accuracy on test set
 Does not generalize to new data any more
Decision Trees  Summary

Decision trees are a classification technique

The output of decision trees can be used for descriptive as well as predictive purposes

They can represent any function representable with propositional logic

Heuristics such as information gain are used to select relevant attributes

Pruning is used to avoid over fitting
Decision trees are a classification technique
The output of decision trees can be used for descriptive as well as predictive purposes
They can represent any function representable with propositional logic
Heuristics such as information gain are used to select relevant attributes
Pruning is used to avoid over fitting