Decision Tree Algorithms
 At each node, available attributes are evaluated on the basis of separating the classes of the training examples.
 A goodness function is used for this purpose.
 Typical goodness functions:
 information gain (ID3/C4.5)
 information gain ratio
 gini index (CART)
 information gain (ID3/C4.5)
 information gain ratio
 gini index (CART)
Computing Information
 Information is measured in bits
 Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
 Entropy gives the information required in bits (this can involve fractions of bits!)
 Formula for computing the entropy:

entropy(p_{1}...p_{n}) =  p_{1} logp_{1} ... _{n}p logp_{n}
 Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
 Entropy gives the information required in bits (this can involve fractions of bits!)

entropy(p_{1}...p_{n}) =  p_{1} logp_{1} ... _{n}p logp_{n}
IndustrialStrength Algorithms
 For an algorithm to be useful in a wide range of real world applications it must:
 Permit numeric attributes
 Allow missing values
 Be robust in the presence of noise
 Basic schemes need to be extended to fulfill these requirements
 Permit numeric attributes
 Allow missing values
 Be robust in the presence of noise
Pruning
 Prepruning tries to decide a priori when to stop creating subtrees
 Halt construction of decision tree early
 Use same measure as in determining attributes, e.g., halt if InfoGain < threshold
 Most frequent class becomes the leaf node
 This turns out to be fairly difficult to do well in practice
 due to “combinationlock effect”, i.e. two attributes individually have nothing to contribute but are powerful predictors when combined
 Postpruning simplifies an existing decision tree
 Construct complete decision tree
 Then prune it back
 Used in C4.5, CART
 Needs more runtime than prepruning
 Halt construction of decision tree early
 Use same measure as in determining attributes, e.g., halt if InfoGain < threshold
 Most frequent class becomes the leaf node
 This turns out to be fairly difficult to do well in practice
 due to “combinationlock effect”, i.e. two attributes individually have nothing to contribute but are powerful predictors when combined
 Construct complete decision tree
 Then prune it back
 Used in C4.5, CART
 Needs more runtime than prepruning
Postpruning
Subtree replacement replaces a subtree with a single leaf node (main method).
Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent
When to Prune a Tree?
 To determine if a node should be replaced, compare the error rate estimate for the node with the combined error rates of the children.
 Replace the node if its error rate is less than combined rates of its children.

Prune only if it reduces the estimated error

Error on the training data is NOT a useful estimator

Use a holdout set for pruning (“reducederror pruning”)

Limits data you can use for training

C4.5’s method

Derive confidence interval from training data

Use a heuristic limit for the error rate, derived from the confidence interval for pruning

Shaky statistical assumptions (because it is based on training data), but works well in practice
Prune only if it reduces the estimated error

Error on the training data is NOT a useful estimator
Use a holdout set for pruning (“reducederror pruning”)

Limits data you can use for training
C4.5’s method

Derive confidence interval from training data

Use a heuristic limit for the error rate, derived from the confidence interval for pruning

Shaky statistical assumptions (because it is based on training data), but works well in practice
Bernoulli Process
 The Bernulli process is a discretetime stochastic process consisting of a sequence of independent random variables (i.e., a sequence of Bernulli trials)
 Applied to situations where an event does either occur (success) or not occur (failure)
 Let the probability of occurrences be 𝑝, the probability of no occurrence be 𝑞 = 1 − 𝑝.
 Let the total number of independent trials be 𝑛, and the number of successes be 𝑘.
 The probability of 𝑘 successes in n trials is the probability mass function (pmf) of the Binomial Distribution B:
 Let the probability of occurrences be 𝑝, the probability of no occurrence be 𝑞 = 1 − 𝑝.
 Let the total number of independent trials be 𝑛, and the number of successes be 𝑘.
Central Limit Theorem Revisited
 The central limit theorem states that the standardized average of any population of i.i.d. random variables 𝑋𝑖 with mean 𝜇_{𝑋} and variance 𝜎^{2} is asymptotically ~𝑁(0,1), or
 Asymptotic Normality implies that 𝑃(𝑍 < 𝑧 )> Φ(𝑧) a  n > infinity, 𝑜𝑟 𝑃(𝑍 < 𝑧) equal Φ(𝑧)
Using the Confidence Interval of a Normal Distribution
 C4.5 uses a heuristic limit for the error rate, derived from the confidence interval of the error rate for pruning
 𝑥% confidence interval [– 𝑧<= 𝑋<= 𝑧] for random variable with 0 mean is given by: Pr[z
 With a symmetric distribution:
 Pr[z <= <=X z]=12xPr[X >= z]
Confidence Limits

Confidence limits c for the standard normal distribution with 0 mean and a variance of 1:

There is a 25% probability of X being > 0.69 Pr[0.69 X 0.69]

To use this we have to reduce our random variable f to have 0 mean and unit variance
Confidence limits c for the standard normal distribution with 0 mean and a variance of 1:
There is a 25% probability of X being > 0.69 Pr[0.69 X 0.69]
To use this we have to reduce our random variable f to have 0 mean and unit variance
Transforming f
 You prune the tree stronger
 If c goes down=>z goes up and also p goes up
 If n goes down=>p goes up
 with p as an estimator for the error rate
 If c goes down=>z goes up and also p goes up
 If n goes down=>p goes up
 with p as an estimator for the error rate
C4.5’s Method
 Error estimate e for a node (:= upper bound of confidence interval):
 see pic
 If c = 25% then z = 0.69 (from normal distribution)

f is the error on the training data

n is the number of instances covered by the node

Even if information gain increases, e might increase as well

Error estimate for subtree is weighted sum of error estimates for all its leaves
 see pic
 If c = 25% then z = 0.69 (from normal distribution)

f is the error on the training data

n is the number of instances covered by the node
Even if information gain increases, e might increase as well
Error estimate for subtree is weighted sum of error estimates for all its leaves
C4.5 Example
From Trees to Rules
 Simple way: one rule for each leaf
 Often these rules are more complex than necessary
 C4.5rules:
 greedily prune conditions from each rule if this reduces its estimated error (as discussed above)
 Then
 look at each class in turn
 consider the rules for that class
 find a “good” subset (guided by Occam’s Razor)
 Finally, remove rules (greedily) if this decreases error on the training data
 Often these rules are more complex than necessary
 greedily prune conditions from each rule if this reduces its estimated error (as discussed above)
 look at each class in turn
 consider the rules for that class
 find a “good” subset (guided by Occam’s Razor)
 Finally, remove rules (greedily) if this decreases error on the training data
C4.5 and C4.5rules: Summary
 C4.5 decision tree algorithm has two important parameters
 Confidence value (default 25%): lower values incur heavier pruning
 Minimum number of instances in the two most popular branches (default 2)
 Classification rules
 C4.5rules slow for large and noisy datasets
 Commercial version C5.0rules uses a different pruning technique
 Much faster and a bit more accurate
 Confidence value (default 25%): lower values incur heavier pruning
 Minimum number of instances in the two most popular branches (default 2)
 C4.5rules slow for large and noisy datasets
 Commercial version C5.0rules uses a different pruning technique
 Much faster and a bit more accurate
Knowledge Discovery Process
Data Understanding: Quantity
 Number of instances (records)
 Rule of thumb: 5,000 or more desired (nice to have)
 if less, results are less reliable; use special methods (boosting, ...)
 Number of attributes
 Rule of thumb: Start out with less than 50 attributes
 If many attributes, use attribute selection
 Number of targets
 Rule of thumb: >100 for each class
 if very unbalanced, use stratified sampling
 Rule of thumb: 5,000 or more desired (nice to have)
 if less, results are less reliable; use special methods (boosting, ...)
 Rule of thumb: Start out with less than 50 attributes
 If many attributes, use attribute selection
 Rule of thumb: >100 for each class
 if very unbalanced, use stratified sampling
Data Understanding
 Visualization
 Data summaries
 Attribute means
 Attribute variation
 Attribute relationships
 Attribute means
 Attribute variation
 Attribute relationships
Data Cleaning: Outline
 Convert data to a standard format (e.g., arff or csv)
 Unified date format
 Missing values
 Fix errors and outliers
 Convert nominal fields whose values have order to numeric
 Binning (i.e. discretization) of numeric data
Data Cleaning: Missing Values
 Missing data can appear in several forms:
“0” “.” “999” “NA” ...
 Standardize missing value code(s) Dealing with missing values:
 Ignore records with missing values
 Treat missing value as a separate value
 Imputation: fill in with mean or median values
“0” “.” “999” “NA” ...
 Ignore records with missing values
 Treat missing value as a separate value
 Imputation: fill in with mean or median values
Conversion: Ordered to Numeric
 Ordered attributes (e.g. Grade) can be converted to numbers preserving natural order, e.g.
 A 4.0
 A 3.7
 B+3.3
 B 3.0
 Why is it important to preserve natural order?
 To allow meaningful comparisons, e.g. Grade > 3.5
 A 4.0
 A 3.7
 B+3.3
 B 3.0
 To allow meaningful comparisons, e.g. Grade > 3.5
Conversion: Nominal, Few Values
 Multivalued, unordered attributes with small no. of values
 e.g. Color=Red, Orange, Yellow, ...
 for each value v create a binary “flag” variable C_v , which is 1 if Color=v, 0 otherwise
 e.g. Color=Red, Orange, Yellow, ...
 for each value v create a binary “flag” variable C_v , which is 1 if Color=v, 0 otherwise
Nominal, many values: Ignore IDlike fields whose values are unique for each record
Data Cleaning: Discretization
 Discretization reduces the number of values for a continuous attribute
 Why?
 Some methods can only use nominal data
 E.g., in ID3, Apriori, most versions of Naïve Bayes, CHAID
 Helpful if data needs to be sorted frequently (e.g., when constructing a decision tree)
 Some methods that handle numerical attributes assume normal distribution which is not always appropriate
 Discretization is useful for generating a summary of data
 Also called “binning”
 Some methods can only use nominal data
 E.g., in ID3, Apriori, most versions of Naïve Bayes, CHAID
 Helpful if data needs to be sorted frequently (e.g., when constructing a decision tree)
 Some methods that handle numerical attributes assume normal distribution which is not always appropriate
Discretization: EqualWidth
Discretization: EqualWidth May Produce Clumping
Discretization: EqualFrequency
Discretization: Class Dependent (Supervised Discretization)
 Treating numerical attributes as nominal discards the potentially valuable ordering information
 Alternative: Transform the 𝑘 nominal values to 𝑘 − 1 binary attributes. The (𝑖 − 1)th binary attribute indicates whether the discretized attribute is less than 𝑖
Discretization Considerations
 Equal Width is simplest, good for many classes
 can fail miserably for unequal distributions
 Equal Frequency gives better results
 In practice, “almostequal” height binning is used which avoids clumping and gives more intuitive breakpoints
 Also used for clustering (when there is no “class”)
 Classdependent can be better for classification
 Decision trees build discretization on the fly
 Naïve Bayes requires initial discretization
 Other methods exist ...
 can fail miserably for unequal distributions
 In practice, “almostequal” height binning is used which avoids clumping and gives more intuitive breakpoints
 Also used for clustering (when there is no “class”)
 Decision trees build discretization on the fly
 Naïve Bayes requires initial discretization
Unbalanced Target Distribution
 Sometimes, classes have very unequal frequency
 Churn prediction: 97% stay, 3% churn (in a month)
 Medical diagnosis: 90% healthy, 10% disease
 eCommerce: 99% don’t buy, 1% buy
 Security: >99.99% of Germans are not terrorists
 Similar situation with multiple classes
 Majority class classifier can be 97% correct, but useless
 Churn prediction: 97% stay, 3% churn (in a month)
 Medical diagnosis: 90% healthy, 10% disease
 eCommerce: 99% don’t buy, 1% buy
 Security: >99.99% of Germans are not terrorists
Building Balanced Train Sets
Attribute Selection
 Aka feature / variable / field selection
 If there are too many attributes, select a subset that is most relevant.
 Remove redundant and/or irrelevant attributes
 Rule of thumb  keep top 50 attributes
 Rule of thumb  keep top 50 attributes
Reasons for Attribute Selection
 Simpler model
 More transparent
 Easier to interpret
 Faster model induction
 What about overall time?
 What about the accuracy?
 The addition of irrelevant attributes can negatively impact the performance of kNN, DT, regressions, clustering, etc.
 Experiments: Adding a random attribute to a DT learner can decrease the accuracy by 510% (Witten & Frank, 2005)
 Instancebased methods are particularly weak in the presence of irrelevant attributes (compared with only a few instances)
 More transparent
 Easier to interpret
 What about overall time?
 The addition of irrelevant attributes can negatively impact the performance of kNN, DT, regressions, clustering, etc.
 Experiments: Adding a random attribute to a DT learner can decrease the accuracy by 510% (Witten & Frank, 2005)
 Instancebased methods are particularly weak in the presence of irrelevant attributes (compared with only a few instances)
Attribute Selection Heuristics
 Stepwise forward selection
 Start with empty attribute set
 Add “best” of attributes (e.g. using entropy)
 Add “best” of remaining attributes
 Repeat. Take the top n (a certain threshold value)
 Stepwise backward selection
 Start with entire attribute set
 Remove “worst” of attributes
 Repeat until n are left.
 Start with empty attribute set
 Add “best” of attributes (e.g. using entropy)
 Add “best” of remaining attributes
 Repeat. Take the top n (a certain threshold value)
 Start with entire attribute set
 Remove “worst” of attributes
 Repeat until n are left.
Attribute Selection
 Using entropy for attribute selection
 Calculate information gain of each attribute
 Select the n attributes with the highest information gain
 Experiences
 Lead to local, not necessarily global optima
 Nevertheless, they perform reasonably well
 Backward selection performed better than forward selection
 Forward selection leads to smaller attribute sets and easier models

Attribute selection is actually a search problem

Want to select subset of attributes giving most accurate model
 Calculate information gain of each attribute
 Select the n attributes with the highest information gain
 Lead to local, not necessarily global optima
 Nevertheless, they perform reasonably well
 Backward selection performed better than forward selection
 Forward selection leads to smaller attribute sets and easier models
Attribute selection is actually a search problem

Want to select subset of attributes giving most accurate model
Basic Approaches to Attribute Selection
 Remove attributes with no or little variability
 Examine the number of distinct attribute values
 Rule of thumb: remove a field where almost all values are the same (e.g. null), except possibly in minp % or less of all records.
 Remove false predictors (“leakers”)
 False predictors are fields correlated to target behavior, which describe events that happen at the same time or after
 E.g. student final grade, for the task of predicting whether the student passed the course
 Examine the number of distinct attribute values
 Rule of thumb: remove a field where almost all values are the same (e.g. null), except possibly in minp % or less of all records.
 False predictors are fields correlated to target behavior, which describe events that happen at the same time or after
 E.g. student final grade, for the task of predicting whether the student passed the course
Text Mining
 Many text databases exist in practice
 News articles
 Research papers
 Books
 Digital libraries
 Email messages
 Web pages
 News articles
 Research papers
 Books
 Digital libraries
 Email messages
 Web pages
Text Mining Process
 Text preprocessing
 Syntactic/Semantic text analysis
 Features Generation
 Bag of words
 Features Selection
 Simple counting
 Statistics
 Text/Data Mining
 Classification of documents
 Clustering of documents
 Analyzing result
 Syntactic/Semantic text analysis
 Bag of words
 Simple counting
 Statistics
 Classification of documents
 Clustering of documents
Text Attributes
 For each word, generate a boolean attribute indicating whether the text of the instance contains the word
 Alternatively, the attribute may indicate the number of the word’s occurrences
 If “too many” words, keep only the more frequent ones
 Stemming algorithms reduce words to their root, e.g. guarantee, guaranteeing, guaranteed
 Alternatively, the attribute may indicate the number of the word’s occurrences
 If “too many” words, keep only the more frequent ones
 Stemming algorithms reduce words to their root, e.g. guarantee, guaranteeing, guaranteed
Data Mining: Technical Memo Example: Titles

c1 Human machine interface for Lab ABC computer applications

c2 A survey of user opinion of computer system response time

c3 The EPS user interface management system

c4 System and human system engineering testing of EPS

c5 Relation of userperceived response time to error measurement

m1 The generation of random, binary, unordered trees

m2 The intersection graph of paths in trees

m3 Graph minors IV: Widths of trees and wellquasiordering

m4 Graph minors: A survey
Data mining: “Search” versus “Discover”