An internal node is a test on an attribute. A branch represents an outcome of the test, e.g., Color=red. A leaf node represents a class label or class label distribution. At each node, one attribute is chosen to split training examples into distinct classes as much as possible. A new case is classified by following a matching path to a leaf node

Entropy in general, describes the randomness of a system Entropy = 0 describes a perfectly ordered system The term was used in thermodynamics and and statistical mechanics „ A Mathematical Theory of Communication“ by Claude Shannon 1948 defines entropy and information theory „uncertainty about the contents of a message“

Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes 4 -\>Corrects ID problem Gain ratio takes number and size of branches into account when choosing an attribute It corrects the information gain by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to)

Chapter 6: Decision Trees Flashcards by Andreas Hein

Decision Trees

An internal node is a test on an attribute.
A branch represents an outcome of the test, e.g., Color=red.
A leaf node represents a class label or class label distribution.
At each node, one attribute is chosen to split training examples into distinct classes as much as possible.
A new case is classified by following a matching path to a leaf node

How well did you know this?

Not at all

Perfectly

Building Decision Tree

Top-down tree construction (Recursive Algo ⇒ End up with huge tree. If model is over fitted means it does not generalize well )
- At start, all training examples are at the root.
- Partition the examples recursively by choosing one attribute each time.
Bottom-up tree pruning
- Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases

How well did you know this?

Not at all

Perfectly

Choosing the Splitting Attribute

At each node, available attributes are evaluated on the basis of separating the classes of the training examples.
A “goodness” function is used for this purpose. Typical goodness functions:
- information gain (ID3/C4.5)
- information gain ratio
- gini index (CART)
Pcrkct split criteria: Splits data in two sides - Yes and No. ⇒ clean subnodes

How well did you know this?

Not at all

Perfectly

A Criterion for Attribute Selection

Which is the best attribute?
- The one which will result in the smallest tree
- Heuristic: choose the attribute that produces the “purest” nodes
Popular impurity criterion: information gain
- Information gain increases with the average purity of the subsets that an attribute produces
Strategy: choose attribute that results in greatest information gain

How well did you know this?

Not at all

Perfectly

Computing Information

Information is measured in bits
- Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
- Entropy gives the information required in bits (this can involve fractions of bits!)
Formula for computing the information entropy:
- Defines how ordered a system is

How well did you know this?

Not at all

Perfectly

Example: attribute “Outlook”

How well did you know this?

Not at all

Perfectly

Computing the Information Gain

How well did you know this?

Not at all

Perfectly

Continuing to Split

How well did you know this?

Not at all

Perfectly

The Final Decision Tree

How well did you know this?

Not at all

Perfectly

Wish List for a Purity Measure

Properties we require from a purity measure:
- When node is pure, measure should be zero (=0)
- When impurity is maximal (i.e. all classes equally likely), measure should be maximal (e.g., 1 for boolean values)
- Multistage property: Info[2,3,4]=Info[2,7]+7/9 Info[3,4]
Entropy is a function that satisfies these properties

How well did you know this?

Not at all

Perfectly

Entropy

Entropy in general, describes the randomness of a system
- Entropy = 0 describes a perfectly ordered system
The term was used in thermodynamics and and statistical mechanics „ A Mathematical Theory of Communication“ by Claude Shannon 1948 defines entropy and information theory
- „uncertainty about the contents of a message“

How well did you know this?

Not at all

Perfectly

Expected Information Gain

Gain(S,a) is the information gained adding a sub-tree (Reduction in number of bits needed to classify an instance)

Problems?

How well did you know this?

Not at all

Perfectly

Highly-Branching Attributes

Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values
- Information gain is biased towards choosing attributes with a large number of values
- This may result in overfitting (selection of an attribute that is non-optimal for prediction)

How well did you know this?

Not at all

Perfectly

Split for ID Code Attribute

Entropy of split = 0 (since each leaf node is “pure”), having only one case. Information gain is maximal for ID code

How well did you know this?

Not at all

Perfectly

Gain Ratio

Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes 4 ->Corrects ID problem
Gain ratio takes number and size of branches into account when choosing an attribute
It corrects the information gain by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to)

How well did you know this?

Not at all

Perfectly

Computing the Gain Ratio

Study These Flashcards

Example: intrinsic information for ID code

intrinsic_info(1,1, ,1) = 14 x( - 1/14 x log1/14)= 3.807 bits Importance of attribute decreases as intrinsic information grows.

More on the Gain Ratio

Study These Flashcards

„Outlook” still comes out top
However:
- “ID code” has still greater gain ratio (0.246)
- Standard fix: ad hoc test to prevent splitting on that type of attribute
Problem with gain ratio: it may overcompensate
- May choose an attribute just because its intrinsic information is very low
- Standard fix:
  - First, only consider attributes with greater than average information gain
  - Then, compare them on gain ratio

The Splitting Criterion in CART

Study These Flashcards

Classification and Regression Tree (CART)
Developed 1974-1984 by 4 statistics professors
- Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)
Gini Index is used as a splitting criterion
Both C4.5 and CART are robust tools
No method is always superior – experiment!

Gini Index for 2 Attribute Values

Study These Flashcards

For example, two classes, Pos(itive) and Neg(ative), and dataset S with p Pos-elements and n Neg-elements.
- fp = p / (p+n)
- fn = n / (p+n)
- Gini(S) = 1 – fp² - fn²
If dataset S is split into S1, S2 then Gini_split(S₁, S₂) = (p₁+n₁)/(p+n) ·Gini(S₁) + (p₂+n₂)/(p+n) · Gini(S₂)

Gini Index for 2 Attribute Values Example

Study These Flashcards

Gini Index 2 Attributes Example 2

Study These Flashcards

Select the split that decreases the Gini Index most. This is done over all possible places for a split and all possible variables to split.

Generate_DT(samples, attribute_list)

Study These Flashcards

Create a new node N;
If samples are all of class C then label N with C and exit;
If attribute_list is empty then label N with majority_class(samples) and exit;
Select best_split from attribute_list;
For each value v of attribute best_split:
- Let S_v = set of samples with best_split=v ;
- Let N_v = Generate_DT(S_v, attribute_list \ best_split) ;
- Create a branch from N to N_v labeled with the test best_split=v;

Time Complexity of Basic Algorithm

Study These Flashcards

Let m be the number of attributes
Let n be the number of instances
Assumption: Depth of tree is O(log n)
- Usual assumption if the tree is not degenerate
For each level of the tree all n instances are considered (best = vi)
- O(n log n) work for a single attribute over the entire tree
Total cost is O(mn log n) since all attributes are eventually considered.
- Without pruning (see next class)

Scalability of DT Algorithms

Study These Flashcards

Need to design for large amounts of data Two things to worry about
Large number of attributes
- Leads to a large tree
- Takes a long time
Large amounts of data
- Can the data be kept in memory?
- Some new algorithms do not require all the data to be memory resident

Relational Rules (Shapes Problem)

If width \> height then lying If height \> width then standing * Rules comparing attributes to constants are called propositional rules (propositional data mining) * Relational rules are more expressive in some cases (new column: height / width) * define relations between attributes (relational data mining) * most DM techniques do not consider relational rules * As a workaround for some cases, one can introduce additional attributes, describing if width \> height * allows using conventional “propositional” learners

Propositional Logic

* Essentially, decision trees can represent any function in propositional logic * A, B, C: propositional variables * and, or, not, =\> (implies), \<=\> (equivalent): connectives * A proposition is a statement that is either true or false The sky is blue. color of sky = blue * Decision trees are an example of a propositional learner.

C4.5 – An Industrial-Strength Algorithm

* For an algorithm to be useful in a wide range of real- world applications it must: * Permit numeric attributes * Allow missing values * Be robust in the presence of noise * Basic algorithm needs to be extended to fulfill these requirements

Numeric Attributes

* Unlike nominal attributes, every attribute has many possible split points * E.g. temp \< 45 * Standard method: binary splits * Solution is straightforward extension: * Evaluate info gain (or other measure) for every possible split point of attribute * Choose “best” split point * Info gain for best split point is highest info gain for attribute * Numerical attributes can be used several times in a decision tree, nominal attributes only once

Binary Splits on Numeric Attributes

* Splitting (multi-way) on a nominal attribute exhausts all information in that attribute * Nominal attribute is tested (at most) once on any path in the tree * Not so for binary splits on numeric attributes! * Numeric attributes may be tested several times along a path in the tree * Disadvantage: tree is hard to read * Remedy: * pre-discretize numeric attributes, or } separate class * allow for multi-way splits instead of binary ones using the Information Gain criterion

Handling Missing Values / Training

* Ignore instances with missing values * Pretty harsh, and missing value might not be important * Ignore attributes with missing values * Again, may not be feasible * Treat missing value as another nominal value * Fine if missing a value has a significant meaning * Estimate missing values * Data imputation: regression, nearest neighbor, mean, mode, etc.

Overfitting

* Two sources of abnormalities * Noise (randomness) * Outliers (measurement errors) * Chasing every abnormality causes overfitting * Decision tree gets too large and complex * Good accuracy on training set, poor accuracy on test set * Does not generalize to new data any more * Solution: prune the tree

Decision Trees - Summary

* Decision trees are a classification technique * The output of decision trees can be used for descriptive as well as predictive purposes * They can represent any function representable with propositional logic * Heuristics such as information gain are used to select relevant attributes * Pruning is used to avoid over fitting

Chapter 6: Decision Trees Flashcards

(32 cards)