Chapter 7: Pruning in Decision Trees Data Preparation Flashcards

Question 1

Q

Decision Tree Algorithms

Answer

A

At each node, available attributes are evaluated on the basis of separating the classes of the training examples.
A goodness function is used for this purpose.
Typical goodness functions:
- information gain (ID3/C4.5)
- information gain ratio
- gini index (CART)

Question 2

Q

Computing Information

Answer

A

Information is measured in bits
- Given a probability distribution, the info required to predict an event, i.e. if play is yes or no, is the distribution’s entropy
- Entropy gives the information required in bits (this can involve fractions of bits!)
Formula for computing the entropy:
- entropy(p₁…p_n) = - p₁ logp₁ -…- _np logp_n

Question 3

Q

Industrial-Strength Algorithms

Answer

A

For an algorithm to be useful in a wide range of real- world applications it must:
- Permit numeric attributes
- Allow missing values
- Be robust in the presence of noise
Basic schemes need to be extended to fulfill these requirements

Question 4

Q

Pruning

Answer

A

 Prepruning tries to decide a priori when to stop creating subtrees
- Halt construction of decision tree early
- Use same measure as in determining attributes, e.g., halt if InfoGain < threshold
- Most frequent class becomes the leaf node
- This turns out to be fairly difficult to do well in practice
  - due to “combination-lock effect”, i.e. two attributes individually have nothing to contribute but are powerful predictors when combined
Postpruning simplifies an existing decision tree
- Construct complete decision tree
- Then prune it back
- Used in C4.5, CART
- Needs more runtime than prepruning

Question 5

Q

Postpruning

Answer

A

Subtree replacement replaces a subtree with a single leaf node (main method).

Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent

Question 6

Q

When to Prune a Tree?

Answer

A

To determine if a node should be replaced, compare the error rate estimate for the node with the combined error rates of the children.
Replace the node if its error rate is less than combined rates of its children.
Prune only if it reduces the estimated error
- Error on the training data is NOT a useful estimator
Use a hold-out set for pruning (“reduced-error pruning”)
- Limits data you can use for training
C4.5’s method
- Derive confidence interval from training data
- Use a heuristic limit for the error rate, derived from the confidence interval for pruning
- Shaky statistical assumptions (because it is based on training data), but works well in practice

Question 7

Q

Bernoulli Process

Answer

A

 The Bernulli process is a discrete-time stochastic process consisting of a sequence of independent random variables (i.e., a sequence of Bernulli trials)
 Applied to situations where an event does either occur (success) or not occur (failure)
- Let the probability of occurrences be 𝑝, the probability of no occurrence be 𝑞 = 1 − 𝑝.
- Let the total number of independent trials be 𝑛, and the number of successes be 𝑘.
 The probability of 𝑘 successes in n trials is the probability mass function (pmf) of the Binomial Distribution B:

Question 8

Q

Central Limit Theorem Revisited

Answer

A

The central limit theorem states that the standardized average of any population of i.i.d. random variables 𝑋𝑖 with mean 𝜇_𝑋 and variance 𝜎² is asymptotically ~𝑁(0,1), or
Asymptotic Normality implies that 𝑃(𝑍 < 𝑧 )–>  Φ(𝑧) a - n -> infinity, 𝑜𝑟 𝑃(𝑍 < 𝑧) equal  Φ(𝑧)

Question 9

Q

Using the Confidence Interval of a Normal Distribution

Answer

A

C4.5 uses a heuristic limit for the error rate, derived from the confidence interval of the error rate for pruning
 𝑥% confidence interval [– 𝑧<=  𝑋<=  𝑧] for random variable with 0 mean is given by: Pr[-z<x>
</x><li> With a symmetric distribution:</li><li>Pr[z <= <=X  z]=1-2xPr[X >= z] </li>

</x>

Question 10

Q

Confidence Limits

Answer

A

Confidence limits c for the standard normal distribution with 0 mean and a variance of 1:
There is a 25% probability of X being > 0.69 Pr[0.69  X  0.69]
To use this we have to reduce our random variable f to have 0 mean and unit variance

Question 11

Q

Transforming f

Answer

A

 You prune the tree stronger
- If c goes down=>z goes up and also p goes up
- If n goes down=>p goes up 
- with p as an estimator for the error rate

Question 12

Q

C4.5’s Method

Answer

A

 Error estimate e for a node (:= upper bound of confidence interval):
- see pic
- If c = 25% then z = 0.69 (from normal distribution)
- f is the error on the training data 
- n is the number of instances covered by the node
Even if information gain increases, e might increase as well
Error estimate for subtree is weighted sum of error estimates for all its leaves

Question 13

Q

C4.5 Example

Question 14

Q

From Trees to Rules

Answer

A

Simple way: one rule for each leaf
- Often these rules are more complex than necessary
C4.5rules:
- greedily prune conditions from each rule if this reduces its estimated error (as discussed above)
Then
- look at each class in turn
- consider the rules for that class
- find a “good” subset (guided by Occam’s Razor)
- Finally, remove rules (greedily) if this decreases error on the training data

Question 15

Q

C4.5 and C4.5rules: Summary

Answer

A

C4.5 decision tree algorithm has two important parameters
- Confidence value (default 25%): lower values incur heavier pruning
- Minimum number of instances in the two most popular branches (default 2)
Classification rules
- C4.5rules slow for large and noisy datasets
- Commercial version C5.0rules uses a different pruning technique
  - Much faster and a bit more accurate

Question 16

Q

Knowledge Discovery Process

Question 17

Q

Data Understanding: Quantity

Answer

A

Number of instances (records)
- Rule of thumb: 5,000 or more desired (nice to have)
- if less, results are less reliable; use special methods (boosting, …)
Number of attributes
- Rule of thumb: Start out with less than 50 attributes
- If many attributes, use attribute selection
Number of targets
- Rule of thumb: >100 for each class
- if very unbalanced, use stratified sampling

Question 18

Q

Data Understanding

Answer

A

Visualization
Data summaries
- Attribute means
- Attribute variation
- Attribute relationships

Question 19

Q

Data Cleaning: Outline

Answer

A

Convert data to a standard format (e.g., arff or csv)
Unified date format
Missing values
Fix errors and outliers
Convert nominal fields whose values have order to numeric
Binning (i.e. discretization) of numeric data

Question 20

Q

Data Cleaning: Missing Values

Answer

A

Missing data can appear in several forms:
- <empty> “0” “.” “999” “NA” ...</empty>
Standardize missing value code(s) Dealing with missing values:
- Ignore records with missing values
- Treat missing value as a separate value
- Imputation: fill in with mean or median values

Question 21

Q

Conversion: Ordered to Numeric

Answer

A

Ordered attributes (e.g. Grade) can be converted to numbers preserving natural order, e.g.
- A 4.0
- A- 3.7
- B+3.3
- B 3.0
Why is it important to preserve natural order?
- To allow meaningful comparisons, e.g. Grade > 3.5

Question 22

Q

Conversion: Nominal, Few Values

Answer

A

Multi-valued, unordered attributes with small no. of values
- e.g. Color=Red, Orange, Yellow, …
- for each value v create a binary “flag” variable C_v , which is 1 if Color=v, 0 otherwise

Nominal, many values: Ignore ID-like fields whose values are unique for each record

Question 23

Q

Data Cleaning: Discretization

Answer

A

Discretization reduces the number of values for a continuous attribute
Why?
- Some methods can only use nominal data
  - E.g., in ID3, Apriori, most versions of Naïve Bayes, CHAID
- Helpful if data needs to be sorted frequently (e.g., when constructing a decision tree)
- Some methods that handle numerical attributes assume normal distribution which is not always appropriate
Discretization is useful for generating a summary of data
Also called “binning”

Question 24

Q

Discretization: Equal-Width

Question 25

Q

Discretization: Equal-Width May Produce Clumping

Question 26

Q

Discretization: Equal-Frequency

Question 27

Q

Discretization: Class Dependent (Supervised Discretization)

Answer

A

 Treating numerical attributes as nominal discards the potentially valuable ordering information
 Alternative: Transform the 𝑘 nominal values to 𝑘 − 1 binary attributes. The (𝑖 − 1)-th binary attribute indicates whether the discretized attribute is less than 𝑖

Question 28

Q

Discretization Considerations

Answer

A

Equal Width is simplest, good for many classes
- can fail miserably for unequal distributions
Equal Frequency gives better results
- In practice, “almost-equal” height binning is used which avoids clumping and gives more intuitive breakpoints
- Also used for clustering (when there is no “class”)
Class-dependent can be better for classification
- Decision trees build discretization on the fly
- Naïve Bayes requires initial discretization
Other methods exist …

Question 29

Q

Unbalanced Target Distribution

Answer

A

Sometimes, classes have very unequal frequency
- Churn prediction: 97% stay, 3% churn (in a month)
- Medical diagnosis: 90% healthy, 10% disease
- eCommerce: 99% don’t buy, 1% buy
- Security: >99.99% of Germans are not terrorists
Similar situation with multiple classes
Majority class classifier can be 97% correct, but useless

Question 30

Q

Building Balanced Train Sets

Question 31

Q

Attribute Selection

Answer

A

Aka feature / variable / field selection
If there are too many attributes, select a subset that is most relevant.
Remove redundant and/or irrelevant attributes
- Rule of thumb – keep top 50 attributes

Question 32

Q

Reasons for Attribute Selection

Answer

A

Simpler model
- More transparent
- Easier to interpret
Faster model induction
- What about overall time?
What about the accuracy?
- The addition of irrelevant attributes can negatively impact the performance of kNN, DT, regressions, clustering, etc.
  - Experiments: Adding a random attribute to a DT learner can decrease the accuracy by 5-10% (Witten & Frank, 2005)
  - Instance-based methods are particularly weak in the presence of irrelevant attributes (compared with only a few instances)

Question 33

Q

Attribute Selection Heuristics

Answer

A

Stepwise forward selection
- Start with empty attribute set
- Add “best” of attributes (e.g. using entropy)
- Add “best” of remaining attributes
- Repeat. Take the top n (a certain threshold value)
Stepwise backward selection
- Start with entire attribute set
- Remove “worst” of attributes
- Repeat until n are left.

Question 34

Q

Attribute Selection

Answer

A

Using entropy for attribute selection
- Calculate information gain of each attribute
- Select the n attributes with the highest information gain
Experiences
- Lead to local, not necessarily global optima
- Nevertheless, they perform reasonably well
- Backward selection performed better than forward selection
- Forward selection leads to smaller attribute sets and easier models
Attribute selection is actually a search problem
- Want to select subset of attributes giving most accurate model

Question 35

Q

Basic Approaches to Attribute Selection

Answer

A

Remove attributes with no or little variability
- Examine the number of distinct attribute values
- Rule of thumb: remove a field where almost all values are the same (e.g. null), except possibly in minp % or less of all records.
Remove false predictors (“leakers”)
- False predictors are fields correlated to target behavior, which describe events that happen at the same time or after
- E.g. student final grade, for the task of predicting whether the student passed the course

Question 36

Q

Text Mining

Answer

A

Many text databases exist in practice
- News articles
- Research papers
- Books
- Digital libraries
- E-mail messages
- Web pages

Question 37

Q

Text Mining Process

Answer

A

Text preprocessing
- Syntactic/Semantic text analysis
Features Generation
- Bag of words
Features Selection
- Simple counting
- Statistics
 Text/Data Mining
- Classification of documents
- Clustering of documents
 Analyzing result

Question 38

Q

Text Attributes

Answer

A

For each word, generate a boolean attribute indicating whether the text of the instance contains the word
- Alternatively, the attribute may indicate the number of the word’s occurrences
- If “too many” words, keep only the more frequent ones
- Stemming algorithms reduce words to their root, e.g. guarantee, guaranteeing, guaranteed

Question 39

Q

Data Mining: Technical Memo Example: Titles

c1 Human machine interface for Lab ABC computer applications
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and human system engineering testing of EPS
c5 Relation of user-perceived response time to error measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey

Question 40

Q

Data mining: “Search” versus “Discover”

Brainscape's Knowledge GenomeTM

Chapter 7: Pruning in Decision Trees Data Preparation Flashcards

Brainscape's Knowledge Genome^TM