Chapter 7: Pruning in Decision Trees Data Preparation Flashcards

1
Q

Decision Tree Algorithms

A
  • At each node, available attributes are evaluated on the basis of separating the classes of the training examples.
  • A goodness function is used for this purpose.
  • Typical goodness functions:
    • information gain (ID3/C4.5)
    • information gain ratio
    • gini index (CART)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Computing Information

A
  • Information is measured in bits
    • Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
    • Entropy gives the information required in bits (this can involve fractions of bits!)
  • Formula for computing the entropy:
    • entropy(p1…pn) = - p1 logp1 -…- np logpn
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Industrial-Strength Algorithms

A
  • For an algorithm to be useful in a wide range of real- world applications it must:
    • Permit numeric attributes
    • Allow missing values
    • Be robust in the presence of noise
  • Basic schemes need to be extended to fulfill these requirements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Pruning

A
  •  Prepruning tries to decide a priori when to stop creating subtrees
    • Halt construction of decision tree early
    • Use same measure as in determining attributes, e.g., halt if InfoGain < threshold
    • Most frequent class becomes the leaf node
    • This turns out to be fairly difficult to do well in practice
      • due to “combination-lock effect”, i.e. two attributes individually have nothing to contribute but are powerful predictors when combined
  • Postpruning simplifies an existing decision tree
    • Construct complete decision tree
    • Then prune it back
    • Used in C4.5, CART
    • Needs more runtime than prepruning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Postpruning

A

Subtree replacement replaces a subtree with a single leaf node (main method).

Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When to Prune a Tree?

A
  • To determine if a node should be replaced, compare the error rate estimate for the node with the combined error rates of the children.
  • Replace the node if its error rate is less than combined rates of its children.
  • Prune only if it reduces the estimated error
    • Error on the training data is NOT a useful estimator
  • Use a hold-out set for pruning (“reduced-error pruning”)
    • Limits data you can use for training
  • C4.5’s method
    • Derive confidence interval from training data
    • Use a heuristic limit for the error rate, derived from the confidence interval for pruning
    • Shaky statistical assumptions (because it is based on training data), but works well in practice
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bernoulli Process

A
  •  The Bernulli process is a discrete-time stochastic process consisting of a sequence of independent random variables (i.e., a sequence of Bernulli trials)
  •  Applied to situations where an event does either occur (success) or not occur (failure)
    • Let the probability of occurrences be 𝑝, the probability of no occurrence be 𝑞 = 1 − 𝑝.
    • Let the total number of independent trials be 𝑛, and the number of successes be 𝑘.
  •  The probability of 𝑘 successes in n trials is the probability mass function (pmf) of the Binomial Distribution B:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Central Limit Theorem Revisited

A
  • The central limit theorem states that the standardized average of any population of i.i.d. random variables 𝑋𝑖 with mean 𝜇𝑋 and variance 𝜎2 is asymptotically ~𝑁(0,1), or
  • Asymptotic Normality implies that 𝑃(𝑍 < 𝑧 )–>  Φ(𝑧) a - n -> infinity, 𝑜𝑟 𝑃(𝑍 < 𝑧) equal  Φ(𝑧)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Using the Confidence Interval of a Normal Distribution

A
  • C4.5 uses a heuristic limit for the error rate, derived from the confidence interval of the error rate for pruning
  •  𝑥% confidence interval [– 𝑧<=  𝑋<=  𝑧] for random variable with 0 mean is given by: Pr[-z<x>
    </x><li> With a symmetric distribution:</li><li>Pr[z &lt;= &lt;=X  z]=1-2xPr[X &gt;= z] </li>

</x>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Confidence Limits

A
  • Confidence limits c for the standard normal distribution with 0 mean and a variance of 1:
  • There is a 25% probability of X being > 0.69 Pr[0.69  X  0.69]
  • To use this we have to reduce our random variable f to have 0 mean and unit variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Transforming f

A
  •  You prune the tree stronger
    • If c goes down=>z goes up and also p goes up
    • If n goes down=>p goes up 
    • with p as an estimator for the error rate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

C4.5’s Method

A
  •  Error estimate e for a node (:= upper bound of confidence interval):
    • see pic
    • If c = 25% then z = 0.69 (from normal distribution)
    • f is the error on the training data 
    • n is the number of instances covered by the node
  • Even if information gain increases, e might increase as well
  • Error estimate for subtree is weighted sum of error estimates for all its leaves
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

C4.5 Example

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

From Trees to Rules

A
  • Simple way: one rule for each leaf
    • Often these rules are more complex than necessary
  • C4.5rules:
    • greedily prune conditions from each rule if this reduces its estimated error (as discussed above)
  • Then
    • look at each class in turn
    • consider the rules for that class
    • find a “good” subset (guided by Occam’s Razor)
    • Finally, remove rules (greedily) if this decreases error on the training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

C4.5 and C4.5rules: Summary

A
  • C4.5 decision tree algorithm has two important parameters
    • Confidence value (default 25%): lower values incur heavier pruning
    • Minimum number of instances in the two most popular branches (default 2)
  • Classification rules
    • C4.5rules slow for large and noisy datasets
    • Commercial version C5.0rules uses a different pruning technique
      • Much faster and a bit more accurate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Knowledge Discovery Process

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Data Understanding: Quantity

A
  • Number of instances (records)
    • Rule of thumb: 5,000 or more desired (nice to have)
    • if less, results are less reliable; use special methods (boosting, …)
  • Number of attributes
    • Rule of thumb: Start out with less than 50 attributes
    • If many attributes, use attribute selection
  • Number of targets
    • Rule of thumb: >100 for each class
    • if very unbalanced, use stratified sampling
18
Q

Data Understanding

A
  • Visualization
  • Data summaries
    • Attribute means
    • Attribute variation
    • Attribute relationships
19
Q

Data Cleaning: Outline

A
  • Convert data to a standard format (e.g., arff or csv)
  • Unified date format
  • Missing values
  • Fix errors and outliers
  • Convert nominal fields whose values have order to numeric
  • Binning (i.e. discretization) of numeric data
20
Q

Data Cleaning: Missing Values

A
  • Missing data can appear in several forms:
    • <empty> “0” “.” “999” “NA” ...</empty>
  • Standardize missing value code(s) Dealing with missing values:
    • Ignore records with missing values
    • Treat missing value as a separate value
    • Imputation: fill in with mean or median values
21
Q

Conversion: Ordered to Numeric

A
  • Ordered attributes (e.g. Grade) can be converted to numbers preserving natural order, e.g.
    • A 4.0
    • A- 3.7
    • B+3.3
    • B 3.0
  • Why is it important to preserve natural order?
    • To allow meaningful comparisons, e.g. Grade > 3.5
22
Q

Conversion: Nominal, Few Values

A
  • Multi-valued, unordered attributes with small no. of values
    • e.g. Color=Red, Orange, Yellow, …
    • for each value v create a binary “flag” variable C_v , which is 1 if Color=v, 0 otherwise

Nominal, many values: Ignore ID-like fields whose values are unique for each record

23
Q

Data Cleaning: Discretization

A
  • Discretization reduces the number of values for a continuous attribute
  • Why?
    • Some methods can only use nominal data
      • E.g., in ID3, Apriori, most versions of Naïve Bayes, CHAID
    • Helpful if data needs to be sorted frequently (e.g., when constructing a decision tree)
    • Some methods that handle numerical attributes assume normal distribution which is not always appropriate
  • Discretization is useful for generating a summary of data
  • Also called “binning”
24
Q

Discretization: Equal-Width

A
25
Q

Discretization: Equal-Width May Produce Clumping

A
26
Q

Discretization: Equal-Frequency

A
27
Q

Discretization: Class Dependent (Supervised Discretization)

A
  •  Treating numerical attributes as nominal discards the potentially valuable ordering information
  •  Alternative: Transform the 𝑘 nominal values to 𝑘 − 1 binary attributes. The (𝑖 − 1)-th binary attribute indicates whether the discretized attribute is less than 𝑖
28
Q

Discretization Considerations

A
  • Equal Width is simplest, good for many classes
    • can fail miserably for unequal distributions
  • Equal Frequency gives better results
    • In practice, “almost-equal” height binning is used which avoids clumping and gives more intuitive breakpoints
    • Also used for clustering (when there is no “class”)
  • Class-dependent can be better for classification
    • Decision trees build discretization on the fly
    • Naïve Bayes requires initial discretization
  • Other methods exist …
29
Q

Unbalanced Target Distribution

A
  • Sometimes, classes have very unequal frequency
    • Churn prediction: 97% stay, 3% churn (in a month)
    • Medical diagnosis: 90% healthy, 10% disease
    • eCommerce: 99% don’t buy, 1% buy
    • Security: >99.99% of Germans are not terrorists
  • Similar situation with multiple classes
  • Majority class classifier can be 97% correct, but useless
30
Q

Building Balanced Train Sets

A
31
Q

Attribute Selection

A
  • Aka feature / variable / field selection
  • If there are too many attributes, select a subset that is most relevant.
  • Remove redundant and/or irrelevant attributes
    • Rule of thumb – keep top 50 attributes
32
Q

Reasons for Attribute Selection

A
  • Simpler model
    • More transparent
    • Easier to interpret
  • Faster model induction
    • What about overall time?
  • What about the accuracy?
    • The addition of irrelevant attributes can negatively impact the performance of kNN, DT, regressions, clustering, etc.
      • Experiments: Adding a random attribute to a DT learner can decrease the accuracy by 5-10% (Witten & Frank, 2005)
      • Instance-based methods are particularly weak in the presence of irrelevant attributes (compared with only a few instances)
33
Q

Attribute Selection Heuristics

A
  • Stepwise forward selection
    • Start with empty attribute set
    • Add “best” of attributes (e.g. using entropy)
    • Add “best” of remaining attributes
    • Repeat. Take the top n (a certain threshold value)
  • Stepwise backward selection
    • Start with entire attribute set
    • Remove “worst” of attributes
    • Repeat until n are left.
34
Q

Attribute Selection

A
  • Using entropy for attribute selection
    • Calculate information gain of each attribute
    • Select the n attributes with the highest information gain
  • Experiences
    • Lead to local, not necessarily global optima
    • Nevertheless, they perform reasonably well
    • Backward selection performed better than forward selection
    • Forward selection leads to smaller attribute sets and easier models
  • Attribute selection is actually a search problem
    • Want to select subset of attributes giving most accurate model
35
Q

Basic Approaches to Attribute Selection

A
  • Remove attributes with no or little variability
    • Examine the number of distinct attribute values
    • Rule of thumb: remove a field where almost all values are the same (e.g. null), except possibly in minp % or less of all records.
  • Remove false predictors (“leakers”)
    • False predictors are fields correlated to target behavior, which describe events that happen at the same time or after
    • E.g. student final grade, for the task of predicting whether the student passed the course
36
Q

Text Mining

A
  • Many text databases exist in practice
    • News articles
    • Research papers
    • Books
    • Digital libraries
    • E-mail messages
    • Web pages
37
Q

Text Mining Process

A
  • Text preprocessing
    • Syntactic/Semantic text analysis
  • Features Generation
    • Bag of words
  • Features Selection
    • Simple counting
    • Statistics
  •  Text/Data Mining
    • Classification of documents
    • Clustering of documents
  •  Analyzing result
38
Q

Text Attributes

A
  • For each word, generate a boolean attribute indicating whether the text of the instance contains the word
    • Alternatively, the attribute may indicate the number of the word’s occurrences
    • If “too many” words, keep only the more frequent ones
    • Stemming algorithms reduce words to their root, e.g. guarantee, guaranteeing, guaranteed
39
Q

Data Mining: Technical Memo Example: Titles

  • c1 Human machine interface for Lab ABC computer applications
  • c2 A survey of user opinion of computer system response time
  • c3 The EPS user interface management system
  • c4 System and human system engineering testing of EPS
  • c5 Relation of user-perceived response time to error measurement
  • m1 The generation of random, binary, unordered trees
  • m2 The intersection graph of paths in trees
  • m3 Graph minors IV: Widths of trees and well-quasi-ordering
  • m4 Graph minors: A survey
A
40
Q

Data mining: “Search” versus “Discover”

A