The CRISP Data Mining Process
Precision of Classifier Performance

Holdout procedure  split data into two sets

E.g. 66% training, 33% testing (aka holdout procedure)

The more training instances, the better (after a certain size the performance might stagnate)

The more test data, the better the estimation of the error rate

Use kfold cross validation

Split into k sets

Generate k models with k1 sets (leave one set out)

Test each model on k^{th} set

Use leaveoneout (jackknife)

Generate n models on (n1) instances

Apply model to nth instance
Holdout procedure  split data into two sets

E.g. 66% training, 33% testing (aka holdout procedure)

The more training instances, the better (after a certain size the performance might stagnate)

The more test data, the better the estimation of the error rate
Use kfold cross validation

Split into k sets

Generate k models with k1 sets (leave one set out)

Test each model on k^{th} set
Use leaveoneout (jackknife)

Generate n models on (n1) instances

Apply model to nth instance
Holdout Procedures
 Holdout procedure:
 Reserve some data for testing (usually ~1/3) [ select random}
 Use remaining data for training
 Plentiful data  no problem!
 Common case: data set is limited
 Problems:
 Want both sets as large as possible
 Want both sets to be representative
 Reserve some data for testing (usually ~1/3) [ select random}
 Use remaining data for training
 Want both sets as large as possible
 Want both sets to be representative
"Smart" Holdout
 Simple check: Are the proportions of classes about the same in each data set?
 Stratified holdout
 Guarantee that classes are (approximately) proportionally represented in the test and training set
 Repeated holdout (in addition) Do with different sets and multiple time to get independent Test data
 Randomly select holdout set several times and average the error rate estimates
 Guarantee that classes are (approximately) proportionally represented in the test and training set
 Randomly select holdout set several times and average the error rate estimates
Cross Validation
Holdout w/ CrossValidation
 Fixed number of k partitions of the data (folds)
 In turn: each partition is used for testing and the remaining
 instances for training
 Finally each instance is used for testing once
 May use stratification
 Standard practice: stratified tenfold crossvalidation Error rate is estimated by taking the average of error rates
 Finally each instance is used for testing once
LeaveOneOut Holdout
 nFold CrossValidation
 n instances are in the data set
 Use all but one instance for training
 Each iteration is evaluated by predicting the omitted instance
 Advantages / Disadvantages
 Maximum use of the data for training
 Deterministic (no random sampling of test sets)
 High computational cost
 Nonstratified sample!
 n instances are in the data set
 Use all but one instance for training
 Each iteration is evaluated by predicting the omitted instance
 Maximum use of the data for training
 Deterministic (no random sampling of test sets)
 High computational cost
 Nonstratified sample!
Comparing Mining Algorithms
 Suppose we have two algorithms
 Obtain two different models
 Estimate the error rates for the two models
 Compare estimates
 e^{(1)} (2)?
 Select the better one
 Problem?
 Are there “significant” differences in the error rates?
 Obtain two different models
 Estimate the error rates for the two models
 Compare estimates
 e^{(1)} (2)?
 Select the better one
 Are there “significant” differences in the error rates?
Comparing Random Estimates
 Estimated error rate is just an estimate (random)
 Student’s paired ttest tells us whether the means of two samples are significantly different
 Construct a ttest statistic

Need variance as well as point estimates

Need variance as well as point estimates
Counting the Costs of comparing classification errors
 In practice, different types of classification errors often incur different costs Examples:
 Predicting when customers leave a company for competitors (attrition/churn prognosis)
 Much more costly to lose a valuable customer
 Much less costly to act on a false customer
 Loan decisions
 Oilslick detection
 Fault diagnosis
 Promotional mailing
 Predicting when customers leave a company for competitors (attrition/churn prognosis)
 Much more costly to lose a valuable customer
 Much less costly to act on a false customer
Direct Marketing Paradigm
 Find most likely prospects to contact
 Not everybody needs to be contacted
 Number of targets is usually much smaller than number of prospects
 Typical applications (You are only looking for small part)
 retailers, catalogues, direct mail (and email)
 customer acquisition, crosssell, attrition prediction
 ...
 retailers, catalogues, direct mail (and email)
 customer acquisition, crosssell, attrition prediction
 ...
Direct Marketing Evaluation
 Accuracy on the entire dataset is not the right measure (in decision tree leaf node provides propability)
 Approach
 develop a target model
 score all prospects and rank them by decreasing score
 select top P% of prospects for action
 How to decide what is the best selection?
 develop a target model
 score all prospects and rank them by decreasing score
 select top P% of prospects for action
Generating a Gain Curve
 Instances are sorted according to their predicted probability:

In a gain curve

x axis is sample size

y axis is number of positives
In a gain curve

x axis is sample size

y axis is number of positives
Gain Curve: Random List vs. Modelranked List
Lift Curve
ROC Curves
 Differences to gain chart:
 y axis shows percentage of true positives in sample (rather than absolute number):
 TP=(#T predicted T)/#T
 x axis shows percentage of false positives in sample (rather than sample size):
 FP=(#F predicted T)/#F
 provided by true but false
 ROC curves are similar to gain curves
 “ROC” stands for “receiver operating characteristic”
 Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel
 Go throught all sizes of a sample and plot FP vs. TP
 y axis shows percentage of true positives in sample (rather than absolute number):
 TP=(#T predicted T)/#T
 x axis shows percentage of false positives in sample (rather than sample size):
 FP=(#F predicted T)/#F
 provided by true but false
 “ROC” stands for “receiver operating characteristic”
ROC Curves for Two Classifiers
Comparison Studies of Classification Methods
 Large number of classification techniques from the field of ML, NN and Statistics
 Many empirical comparisons in the 80‘s and 90‘s with contradictory results
 StatLog Project (mid 90‘s)
 More than 20 methods and 20 datasets
 Performance of methods depends very much on the data set
 No general guidelines
 Comparison study is advisable in special cases
 Stateofthepractice
 Comparisons of several methods on a particular data set
 Sometimes even automated „mass modeling“
 Method tanking :
 More than 20 methods and 20 datasets
 Performance of methods depends very much on the data set
 No general guidelines
 Comparison study is advisable in special cases
 Comparisons of several methods on a particular data set
 Sometimes even automated „mass modeling“
 Method tanking :
„Prime Candidates“ – Recommendations from StatLog

Following methods should be considered in realworld comparison studies

Logistic Regression (Discriminant Analysis)

Decision trees

KNearest Neighbour

Nonparametric statistical methods

Neural Networks
 Decision trees and logistic regression are widely used in practice
 High performance / low error rate
 Speed of model building and classification
 Easy to explain
 as compared to NN or kNN
Following methods should be considered in realworld comparison studies

Logistic Regression (Discriminant Analysis)

Decision trees

KNearest Neighbour

Nonparametric statistical methods

Neural Networks
 High performance / low error rate
 Speed of model building and classification
 Easy to explain
 as compared to NN or kNN
A Real World Example of Classification: Campaign Management
 Problem definition:
 Marketing campaign for online billing
 Selection of a few tousends from a few million customers, who are potential responders to a campaign
 Questions
 Probability of customer response?
 Which customer attributes are relevant for characterizing a responder?
 Marketing campaign for online billing
 Selection of a few tousends from a few million customers, who are potential responders to a campaign
 Probability of customer response?
 Which customer attributes are relevant for characterizing a responder?
Model Selection Criteria  best model
 Can we find a single ‘best’ model / classifier?
 Model selection criteria attempt to find a good compromise between:
 The complexity of a model
 Its prediction accuracy on the training data
 Reasoning: a good model is a simple model that achieves high accuracy on the given data
 Also known as Occam’s Razor: the best theory is the smallest one that describes all the facts
 The complexity of a model
 Its prediction accuracy on the training data
Elegance vs. Errors
 Theory 1: very simple, elegant theory that explains the data almost perfectly
 Theory 2: significantly more complex theory that reproduces the data without mistakes
 Theory 1 is probably preferable
 Classical example: Kepler’s three laws on planetary motion
 Less accurate than Copernicus’s latest refinement of the Ptolemaic theory of epicycles
 Less accurate than Copernicus’s latest refinement of the Ptolemaic theory of epicycles
The MDL Principle
 MDL principle is a model selection criterion
 MDL stands for minimum description length
 The description length is defined as:
 space required to describe a theory
 +
 space required to describe the theory’s mistakes
 In our case the theory is the classifier and the mistakes are the errors on the training data
 Aim: we want a classifier with minimal DL
 MDL stands for minimum description length
 space required to describe a theory
 +
 space required to describe the theory’s mistakes
Inductive Learning Systems
Randomness vs. Regularity
 0110001101...
 random string=incompressible=maximal information
 010101010101010101
 regularity of repetition allows compression
 If the training set is ‘representative’ then regularities in the training set will be representative for regularities in an unseen test set.
 A lot of forms of learning can be described as data compression in the following sense:
 random string=incompressible=maximal information
 regularity of repetition allows compression
 If the training set is ‘representative’ then regularities in the training set will be representative for regularities in an unseen test set.
Kolmogorov Complexity
 The Kolmogorov complexity (K) of a binary object is the length of the shortest program that generates this object on a universal Turing machine
 Random strings are not compressible
 A message with low Kolmogorov complexity is compressible
 K as complexity measure is incomputable
 So in practical applications it always needs to be approximated, e.g. by LempelZiv (used in zip and unzip) compression or others
 Ray Solomonoff founded the theory of universal inductive inference, which draws on Kolmogorov complexity. Kolmogorov and Solomonoff invented the concept in parallel.
 Random strings are not compressible
 A message with low Kolmogorov complexity is compressible
 So in practical applications it always needs to be approximated, e.g. by LempelZiv (used in zip and unzip) compression or others
Kolmogorov Complexity
 Let x be a finite length binary string and U a universal computer.
 Let U(p) be the output of U when presented with program p.
 The Kolmogorov complexity of string x is the minimal description length of x.
Overview of LZ Algorithms
 The LempelZiv algorithm can be used as an approximation of the Kolmogorov complexity.
 Demonstration
 Take a simple alphabet containing only two letters, a and b
 Create a sample stream of text
 aaababbbaaabaaaaaaabaabb
 Take a simple alphabet containing only two letters, a and b
 Create a sample stream of text
LZ Algorithms
 Rule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far.
 aaababbbaaabaaaaaaabaabb
 aaababbbaaabaaaaaaabaabb
 aaababbbaaabaaaaaaabaabb
 aaababbbaaabaaaaaaabaabb
Indexing
Before compression, the pieces of text from the breaking down process are indexed from 1 to n:
LZ
 Indices are used to number the pieces of data.
 The empty string (start of text) has index 0.
 The piece indexed by 1 is a. Thus a, together with the initial string, must be numbered Oa.
 String 2, aa, will be numbered 1a, because it contains a, whose index is 1, and the new character a.
 The empty string (start of text) has index 0.
 The piece indexed by 1 is a. Thus a, together with the initial string, must be numbered Oa.
 String 2, aa, will be numbered 1a, because it contains a, whose index is 1, and the new character a.
Good Compression?
 In a long string of text, the number of bits needed to transmit the coded information is small compared to the actual length of the text.
 Example: 16 bits to transmit the code 2b instead of 24 bits (8 + 8 + 8) needed for the actual text aab.
MDLapproach to Learning
 Occam’s Razor
 “Among equivalent models choose the simplest one.”
 Minimum Description Length (MDL)
 “Select model that describes data with minimal #bits.”
 model = shortest program that outputs data
 length of program = Kolmogorov Complexity
 Learning = finding regularities = compression
 “Among equivalent models choose the simplest one.”
 “Select model that describes data with minimal #bits.”
 model = shortest program that outputs data
 length of program = Kolmogorov Complexity