Chapter 7: Pruning in Decision Trees Data Preparation Flashcards Preview

Business Analytics > Chapter 7: Pruning in Decision Trees Data Preparation > Flashcards

Flashcards in Chapter 7: Pruning in Decision Trees Data Preparation Deck (40):

Decision Tree Algorithms 

  • At each node, available attributes are evaluated on the basis of separating the classes of the training examples.
  • A goodness function is used for this purpose.
  • Typical goodness functions:
    • information gain (ID3/C4.5)
    • information gain ratio
    • gini index (CART) 


Computing Information 

  • Information is measured in bits
    • Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
    • Entropy gives the information required in bits (this can involve fractions of bits!)
  • Formula for computing the entropy: 
    • entropy( = - p1 logp1 -...- np logpn


Industrial-Strength Algorithms 

  •  For an algorithm to be useful in a wide range of real- world applications it must:
    • Permit numeric attributes
    • Allow missing values
    • Be robust in the presence of noise
  •  Basic schemes need to be extended to fulfill these requirements 



  •  Prepruning tries to decide a priori when to stop creating subtrees
    • Halt construction of decision tree early
    • Use same measure as in determining attributes, e.g., halt if InfoGain < threshold
    • Most frequent class becomes the leaf node
    • This turns out to be fairly difficult to do well in practice
      • due to “combination-lock effect”, i.e. two attributes individually have nothing to contribute but are powerful predictors when combined
  • Postpruning simplifies an existing decision tree
    • Construct complete decision tree
    • Then prune it back
    • Used in C4.5, CART
    • Needs more runtime than prepruning 



Subtree replacement replaces a subtree with a single leaf node (main method).

Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent 


When to Prune a Tree? 

  • To determine if a node should be replaced, compare the error rate estimate for the node with the combined error rates of the children.
  • Replace the node if its error rate is less than combined rates of its children. 
  • Prune only if it reduces the estimated error

    • Error on the training data is NOT a useful estimator

  • Use a hold-out set for pruning (“reduced-error pruning”)

    • Limits data you can use for training

  • C4.5’s method

    • Derive confidence interval from training data

    • Use a heuristic limit for the error rate, derived from the confidence interval for pruning

    • Shaky statistical assumptions (because it is based on training data), but works well in practice 


Bernoulli Process 

  •  The Bernulli process is a discrete-time stochastic process consisting of a sequence of independent random variables (i.e., a sequence of Bernulli trials)
  •  Applied to situations where an event does either occur (success) or not occur (failure)
    • Let the probability of occurrences be 𝑝, the probability of no occurrence be 𝑞 = 1 − 𝑝.
    • Let the total number of independent trials be 𝑛, and the number of successes be 𝑘.
  •  The probability of 𝑘 successes in n trials is the probability mass function (pmf) of the Binomial Distribution B: 


Central Limit Theorem Revisited 

  • The central limit theorem states that the standardized average of any population of i.i.d. random variables 𝑋𝑖 with mean 𝜇𝑋 and variance 𝜎2 is asymptotically ~𝑁(0,1), or
  • Asymptotic Normality implies that 𝑃(𝑍 < 𝑧 )-->  Φ(𝑧) a - n -> infinity, 𝑜𝑟 𝑃(𝑍 < 𝑧) equal  Φ(𝑧) 


Using the Confidence Interval of a Normal Distribution 

  • C4.5 uses a heuristic limit for the error rate, derived from the confidence interval of the error rate for pruning
  •  𝑥% confidence interval [– 𝑧<=  𝑋<=  𝑧] for random variable with 0 mean is given by: Pr[-z
  •  With a symmetric distribution:
  • Pr[z <= <=X  z]=1-2xPr[X >= z] 


Confidence Limits 

  • Confidence limits c for the standard normal distribution with 0 mean and a variance of 1: 

  • There is a 25% probability of X being > 0.69 Pr[0.69  X  0.69]

  •  To use this we have to reduce our random variable f to have 0 mean and unit variance 


Transforming f 

  •  You prune the tree stronger
    • If c goes down=>z goes up and also p goes up
    • If n goes down=>p goes up 
    • with p as an estimator for the error rate 


C4.5’s Method 

  •  Error estimate e for a node (:= upper bound of confidence interval): 
    • see pic
    • If c = 25% then z = 0.69 (from normal distribution) 
    • f is the error on the training data 

    • n is the number of instances covered by the node 

  • Even if information gain increases, e might increase as well

  •  Error estimate for subtree is weighted sum of error estimates for all its leaves 


C4.5 Example


From Trees to Rules 

  • Simple way: one rule for each leaf
    • Often these rules are more complex than necessary
  • C4.5rules:
    • greedily prune conditions from each rule if this reduces its estimated error (as discussed above)
  • Then
    • look at each class in turn
    • consider the rules for that class
    • find a “good” subset (guided by Occam’s Razor)
    • Finally, remove rules (greedily) if this decreases error on the training data 


C4.5 and C4.5rules: Summary 

  • C4.5 decision tree algorithm has two important parameters
    • Confidence value (default 25%): lower values incur heavier pruning
    • Minimum number of instances in the two most popular branches (default 2)
  • Classification rules
    • C4.5rules slow for large and noisy datasets
    • Commercial version C5.0rules uses a different pruning technique
      • Much faster and a bit more accurate 


Knowledge Discovery Process 


Data Understanding: Quantity 

  • Number of instances (records)
    • Rule of thumb: 5,000 or more desired (nice to have)
    • if less, results are less reliable; use special methods (boosting, ...)
  • Number of attributes
    • Rule of thumb: Start out with less than 50 attributes
    • If many attributes, use attribute selection
  • Number of targets
    • Rule of thumb: >100 for each class
    • if very unbalanced, use stratified sampling 


Data Understanding 

  • Visualization
  • Data summaries
    • Attribute means
    • Attribute variation
    • Attribute relationships 


Data Cleaning: Outline 

  • Convert data to a standard format (e.g., arff or csv)
  • Unified date format
  • Missing values
  • Fix errors and outliers
  • Convert nominal fields whose values have order to numeric
  • Binning (i.e. discretization) of numeric data 


Data Cleaning: Missing Values 

  • Missing data can appear in several forms:
    • “0” “.” “999” “NA” ...
  • Standardize missing value code(s) Dealing with missing values:
    • Ignore records with missing values
    • Treat missing value as a separate value 
    • Imputation: fill in with mean or median values 


Conversion: Ordered to Numeric 

  • Ordered attributes (e.g. Grade) can be converted to numbers preserving natural order, e.g.
    • A 4.0
    • A- 3.7
    • B+3.3
    • B 3.0
  • Why is it important to preserve natural order?
    • To allow meaningful comparisons, e.g. Grade > 3.5 


Conversion: Nominal, Few Values 

  • Multi-valued, unordered attributes with small no. of values
    • e.g. Color=Red, Orange, Yellow, ...
    • for each value v create a binary “flag” variable C_v , which is 1 if Color=v, 0 otherwise 

Nominal, many values: Ignore ID-like fields whose values are unique for each record 


Data Cleaning: Discretization 

  • Discretization reduces the number of values for a continuous attribute
  • Why?
    • Some methods can only use nominal data
      • E.g., in ID3, Apriori, most versions of Naïve Bayes, CHAID
    • Helpful if data needs to be sorted frequently (e.g., when constructing a decision tree)
    • Some methods that handle numerical attributes assume normal distribution which is not always appropriate
  • Discretization is useful for generating a summary of data
  • Also called “binning” 


Discretization: Equal-Width 


Discretization: Equal-Width May Produce Clumping 


Discretization: Equal-Frequency 


Discretization: Class Dependent (Supervised Discretization) 

  •  Treating numerical attributes as nominal discards the potentially valuable ordering information
  •  Alternative: Transform the 𝑘 nominal values to 𝑘 − 1 binary attributes. The (𝑖 − 1)-th binary attribute indicates whether the discretized attribute is less than 𝑖 


Discretization Considerations 

  • Equal Width is simplest, good for many classes
    • can fail miserably for unequal distributions
  • Equal Frequency gives better results
    • In practice, “almost-equal” height binning is used which avoids clumping and gives more intuitive breakpoints
    • Also used for clustering (when there is no “class”)
  • Class-dependent can be better for classification
    • Decision trees build discretization on the fly
    • Naïve Bayes requires initial discretization
  • Other methods exist ... 


Unbalanced Target Distribution 

  • Sometimes, classes have very unequal frequency
    • Churn prediction: 97% stay, 3% churn (in a month)
    • Medical diagnosis: 90% healthy, 10% disease
    • eCommerce: 99% don’t buy, 1% buy
    • Security: >99.99% of Germans are not terrorists
  • Similar situation with multiple classes
  • Majority class classifier can be 97% correct, but useless 


Building Balanced Train Sets 


Attribute Selection 

  • Aka feature / variable / field selection
  • If there are too many attributes, select a subset that is most relevant.
  • Remove redundant and/or irrelevant attributes
    • Rule of thumb -- keep top 50 attributes 


Reasons for Attribute Selection 

  • Simpler model
    • More transparent
    • Easier to interpret
  • Faster model induction
    • What about overall time?
  • What about the accuracy?
    • The addition of irrelevant attributes can negatively impact the performance of kNN, DT, regressions, clustering, etc.
      • Experiments: Adding a random attribute to a DT learner can decrease the accuracy by 5-10% (Witten & Frank, 2005)
      • Instance-based methods are particularly weak in the presence of irrelevant attributes (compared with only a few instances) 


Attribute Selection Heuristics 

  • Stepwise forward selection
    • Start with empty attribute set
    • Add “best” of attributes (e.g. using entropy)
    • Add “best” of remaining attributes
    • Repeat. Take the top n (a certain threshold value)
  • Stepwise backward selection
    • Start with entire attribute set
    • Remove “worst” of attributes
    • Repeat until n are left. 


Attribute Selection 

  • Using entropy for attribute selection
    • Calculate information gain of each attribute
    • Select the n attributes with the highest information gain
  • Experiences
    • Lead to local, not necessarily global optima
    • Nevertheless, they perform reasonably well
    • Backward selection performed better than forward selection
    • Forward selection leads to smaller attribute sets and easier models 
  • Attribute selection is actually a search problem

    • Want to select subset of attributes giving most accurate model 


Basic Approaches to Attribute Selection 

  • Remove attributes with no or little variability
    • Examine the number of distinct attribute values
    • Rule of thumb: remove a field where almost all values are the same (e.g. null), except possibly in minp % or less of all records.
  • Remove false predictors (“leakers”)
    • False predictors are fields correlated to target behavior, which describe events that happen at the same time or after
    • E.g. student final grade, for the task of predicting whether the student passed the course 


Text Mining 

  • Many text databases exist in practice
    • News articles
    • Research papers
    • Books
    • Digital libraries
    • E-mail messages
    • Web pages 


Text Mining Process 

  • Text preprocessing
    • Syntactic/Semantic text analysis
  • Features Generation
    • Bag of words
  • Features Selection
    • Simple counting
    • Statistics
  •  Text/Data Mining
    • Classification of documents
    • Clustering of documents
  •   Analyzing result 


Text Attributes 

  • For each word, generate a boolean attribute indicating whether the text of the instance contains the word
    • Alternatively, the attribute may indicate the number of the word’s occurrences
    • If “too many” words, keep only the more frequent ones
    • Stemming algorithms reduce words to their root, e.g. guarantee, guaranteeing, guaranteed 


Data Mining: Technical Memo Example: Titles 

  • c1  Human machine interface for Lab ABC computer applications

  • c2  A survey of user opinion of computer system response time

  • c3  The EPS user interface management system

  • c4  System and human system engineering testing of EPS

  • c5  Relation of user-perceived response time to error measurement

  • m1  The generation of random, binary, unordered trees

  • m2  The intersection graph of paths in trees

  • m3  Graph minors IV: Widths of trees and well-quasi-ordering

  • m4  Graph minors: A survey 


Data mining: “Search” versus “Discover”