Chapter 6: Decision Trees Flashcards Preview

Business Analytics > Chapter 6: Decision Trees > Flashcards

Flashcards in Chapter 6: Decision Trees Deck (32):
1

Decision Trees 

  • An internal node is a test on an attribute.
  • A branch represents an outcome of the test, e.g., Color=red.
  • A leaf node represents a class label or class label distribution.
  • At each node, one attribute is chosen to split training examples into distinct classes as much as possible.
  • A new case is classified by following a matching path to a leaf node 

2

Building Decision Tree 

  • Top-down tree construction (Recursive Algo ⇒ End up with huge tree. If model is over fitted means it does not generalize well )

    • At start, all training examples are at the root.

    • Partition the examples recursively by choosing one attribute each time.

  • Bottom-up tree pruning

    • Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases 

3

Choosing the Splitting Attribute 

  • At each node, available attributes are evaluated on the basis of separating the classes of the training examples.
  • A “goodness” function is used for this purpose. Typical goodness functions:
    • information gain (ID3/C4.5)
    • information gain ratio
    • gini index (CART)
  • Pcrkct split criteria: Splits data in two sides -  Yes and No. ⇒ clean subnodes 

4

A Criterion for Attribute Selection 

  • Which is the best attribute?
    • The one which will result in the smallest tree
    • Heuristic: choose the attribute that produces the “purest” nodes
  • Popular impurity criterion: information gain
    • Information gain increases with the average purity of the subsets that an attribute produces
  • Strategy: choose attribute that results in greatest information gain 

5

Computing Information 

  • Information is measured in bits
    • Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
    • Entropy gives the information required in bits (this can involve fractions of bits!)
  • Formula for computing the information entropy:
    • Defines how ordered a system is

A image thumb
6

Example: attribute “Outlook” 

A image thumb
7

Computing the Information Gain 

A image thumb
8

Continuing to Split 

A image thumb
9

The Final Decision Tree 

A image thumb
10

Wish List for a Purity Measure 

  • Properties we require from a purity measure:
    • When node is pure, measure should be zero (=0)
    • When impurity is maximal (i.e. all classes equally likely), measure should be maximal (e.g., 1 for boolean values)
    • Multistage property: Info[2,3,4]=Info[2,7]+7/9 Info[3,4]
  •  Entropy is a function that satisfies these properties 

A image thumb
11

Entropy 

  • Entropy in general, describes the randomness of a system
    • Entropy = 0 describes a perfectly ordered system
  • The term was used in thermodynamics and and statistical mechanics „ A Mathematical Theory of Communication“ by Claude Shannon 1948 defines entropy and information theory
    • „uncertainty about the contents of a message“ 

12

Expected Information Gain 

Gain(S,a) is the information gained adding a sub-tree (Reduction in number of bits needed to classify an instance)

Problems? 

A image thumb
13

Highly-Branching Attributes 

  • Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values
    • Information gain is biased towards choosing attributes with a large number of values
    • This may result in overfitting (selection of an attribute that is non-optimal for prediction) 

14

Split for ID Code Attribute 

Entropy of split = 0 (since each leaf node is “pure”), having only one case. Information gain is maximal for ID code 

A image thumb
15

Gain Ratio 

  • Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes 4 ->Corrects ID problem
  • Gain ratio takes number and size of branches into account when choosing an attribute
  • It corrects the information gain by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to) 

16

Computing the Gain Ratio 

Example: intrinsic information for ID code

intrinsic_info(1,1, ,1) = 14 x( - 1/14 x log1/14)=  3.807 bits Importance of attribute decreases as intrinsic information grows. 

A image thumb
17

More on the Gain Ratio 

  • „Outlook” still comes out top
  • However:
    • “ID code” has still greater gain ratio (0.246)
    • Standard fix: ad hoc test to prevent splitting on that type of attribute
  • Problem with gain ratio: it may overcompensate
    • May choose an attribute just because its intrinsic information is very low
    • Standard fix:
      • First, only consider attributes with greater than average information gain
      • Then, compare them on gain ratio 

18

The Splitting Criterion in CART 

  • Classification and Regression Tree (CART)
  • Developed 1974-1984 by 4 statistics professors
    • Leo Breiman (Berkeley), Jerry Friedman (Stanford), Charles Stone (Berkeley), Richard Olshen (Stanford)
  • Gini Index is used as a splitting criterion
  • Both C4.5 and CART are robust tools
  • No method is always superior – experiment! 

19

Gini Index for 2 Attribute Values 

  • For example, two classes, Pos(itive) and Neg(ative), and dataset S with p Pos-elements and n Neg-elements. 

    • fp = p / (p+n)

    • fn = n / (p+n)

    • Gini(S) = 1 – fp2 - fn2 

  • If dataset S is split into S1, S2 then Ginisplit(S1, S2) = (p1+n1)/(p+n) ·Gini(S1) + (p2+n2)/(p+n) · Gini(S2

20

Gini Index for 2 Attribute Values Example

A image thumb
21

Gini Index 2 Attributes Example 2

Select the split that decreases the Gini Index most. This is done over all possible places for a split and all possible variables to split. 

A image thumb
22

Generate_DT(samples, attribute_list) 

  • Create a new node N;
  •  If samples are all of class C then label N with C and exit;
  • If attribute_list is empty then label N with majority_class(samples) and exit;
  • Select best_split from attribute_list;
  • For each value v of attribute best_split:
    • Let S_v = set of samples with best_split=v ;
    • Let N_v = Generate_DT(S_v, attribute_list \ best_split) ;
    • Create a branch from N to N_v labeled with the test best_split=v; 

23

Time Complexity of Basic Algorithm 

  • Let m be the number of attributes

  • Let n be the number of instances

  • Assumption: Depth of tree is O(log n)

    • Usual assumption if the tree is not degenerate

  • For each level of the tree all n instances are considered (best = vi)

    • O(n log n) work for a single attribute over the entire tree

  • Total cost is O(mn log n) since all attributes are eventually considered.

    • Without pruning (see next class) 

24

Scalability of DT Algorithms 

  • Need to design for large amounts of data Two things to worry about
  • Large number of attributes
    • Leads to a large tree
    • Takes a long time
  • Large amounts of data
    • Can the data be kept in memory?
    • Some new algorithms do not require all the data to be memory resident 

25

Relational Rules (Shapes Problem)

If width > height then lying
If height > width then standing

  • Rules comparing attributes to constants are called propositional rules (propositional data mining)

  • Relational rules are more expressive in some cases (new column: height / width)

    • define relations between attributes (relational data mining)

    • most DM techniques do not consider relational rules

  • As a workaround for some cases, one can introduce additional attributes, describing if width > height

    • allows using conventional “propositional” learners 

26

Propositional Logic 

  • Essentially, decision trees can represent any function in propositional logic
    • A, B, C: propositional variables
    • and, or, not, => (implies), <=> (equivalent): connectives
  • A proposition is a statement that is either true or false The sky is blue. color of sky = blue
  • Decision trees are an example of a propositional learner. 

27

C4.5 – An Industrial-Strength Algorithm 

  • For an algorithm to be useful in a wide range of real- world applications it must:
    • Permit numeric attributes
    • Allow missing values
    • Be robust in the presence of noise
  • Basic algorithm needs to be extended to fulfill these requirements 

28

Numeric Attributes 

  • Unlike nominal attributes, every attribute has many

    possible split points

    • E.g. temp < 45

    • Standard method: binary splits

  • Solution is straightforward extension:

    • Evaluate info gain (or other measure) for every possible split point of attribute

    • Choose “best” split point

    • Info gain for best split point is highest info gain for attribute

  • Numerical attributes can be used several times in a decision tree, nominal attributes only once 

29

Binary Splits on Numeric Attributes 

  • Splitting (multi-way) on a nominal attribute exhausts all information in that attribute

    • Nominal attribute is tested (at most) once on any path in the tree

  • Not so for binary splits on numeric attributes!

    • Numeric attributes may be tested several times along a path in the tree

  • Disadvantage: tree is hard to read

  • Remedy:

    • pre-discretize numeric attributes, or } separate class

    • allow for multi-way splits instead of binary ones using the Information Gain criterion 

30

Handling Missing Values / Training 

  • Ignore instances with missing values
    • Pretty harsh, and missing value might not be important
  • Ignore attributes with missing values
    • Again, may not be feasible
  • Treat missing value as another nominal value
    • Fine if missing a value has a significant meaning
  • Estimate missing values
    • Data imputation: regression, nearest neighbor, mean, mode, etc. 

31

Overfitting 

  • Two sources of abnormalities
    • Noise (randomness)
    • Outliers (measurement errors)
  • Chasing every abnormality causes overfitting
    • Decision tree gets too large and complex
    • Good accuracy on training set, poor accuracy on test set
    • Does not generalize to new data any more
  • Solution: prune the tree 

32

Decision Trees - Summary 

  • Decision trees are a classification technique

  • The output of decision trees can be used for descriptive as well as predictive purposes

  • They can represent any function representable with propositional logic

  • Heuristics such as information gain are used to select relevant attributes

  • Pruning is used to avoid over fitting