Chapter 8: Evaluation of Classifiers Flashcards Preview

Business Analytics > Chapter 8: Evaluation of Classifiers > Flashcards

Flashcards in Chapter 8: Evaluation of Classifiers Deck (33):

The CRISP Data Mining Process 


Precision of Classifier Performance 

  • Holdout procedure - split data into two sets

    • E.g. 66% training, 33% testing (aka holdout procedure)

    • The more training instances, the better (after a certain size the performance might stagnate)

    • The more test data, the better the estimation of the error rate 

  • Use k-fold cross validation

    • Split into k sets

    • Generate k models with k-1 sets (leave one set out) 

    • Test each model on kth set 

  • Use leave-one-out (jackknife)

    • Generate n models on (n-1) instances

    • Apply model to nth instance 


Holdout Procedures 

  • Holdout procedure:
    • Reserve some data for testing (usually ~1/3) [ select random}
    • Use remaining data for training
  • Plentiful data - no problem!
  • Common case: data set is limited
  • Problems:
    • Want both sets as large as possible
    • Want both sets to be representative 


"Smart" Holdout 

  • Simple check: Are the proportions of classes about the same in each data set?
  • Stratified holdout
    • Guarantee that classes are (approximately) proportionally represented in the test and training set
  • Repeated holdout (in addition) Do with different sets and multiple time to get independent Test data
    • Randomly select holdout set several times and average the error rate estimates 


Cross Validation 


Holdout w/ Cross-Validation 

  • Fixed number of k partitions of the data (folds)
  • In turn: each partition is used for testing and the remaining
  • instances for training
    • Finally each instance is used for testing once
  • May use stratification
  • Standard practice: stratified tenfold cross-validation Error rate is estimated by taking the average of error rates 


Leave-One-Out Holdout 

  • n-Fold Cross-Validation
    • n instances are in the data set
    • Use all but one instance for training
    • Each iteration is evaluated by predicting the omitted instance
  • Advantages / Disadvantages
    • Maximum use of the data for training
    • Deterministic (no random sampling of test sets)
    • High computational cost
    • Non-stratified sample! 


Comparing Mining Algorithms 

  • Suppose we have two algorithms
    • Obtain two different models
    • Estimate the error rates for the two models
    • Compare estimates
      • e(1) (2)?
    • Select the better one
  • Problem?
    • Are there “significant” differences in the error rates? 


Comparing Random Estimates 

  • Estimated error rate is just an estimate (random)
  • Student’s paired t-test tells us whether the means of two samples are significantly different
  • Construct a t-test statistic 
    • Need variance as well as point estimates


Counting the Costs of comparing classification errors

  • In practice, different types of classification errors often incur different costs Examples:
    • Predicting when customers leave a company for competitors (attrition/churn prognosis)
      • Much more costly to lose a valuable customer
      • Much less costly to act on a false customer 
  • Loan decisions
  • Oil-slick detection
  • Fault diagnosis
  • Promotional mailing 


Direct Marketing Paradigm 

  • Find most likely prospects to contact
  • Not everybody needs to be contacted
  • Number of targets is usually much smaller than number of prospects
  • Typical applications (You are only looking for small part)
    • retailers, catalogues, direct mail (and e-mail)
    • customer acquisition, cross-sell, attrition prediction
    •  ... 


Direct Marketing Evaluation 

  • Accuracy on the entire dataset is not the right measure (in decision tree leaf node provides propability)
  • Approach
    • develop a target model
    • score all prospects and rank them by decreasing score
    • select top P% of prospects for action
  • How to decide what is the best selection? 


Generating a Gain Curve 

  • Instances are sorted according to their predicted probability:
  • In a gain curve

    • x axis is sample size

    • y axis is number of positives 


Gain Curve: Random List vs. Model-ranked List 


Lift Curve 


ROC Curves 

  • Differences to gain chart:
    • y axis shows percentage of true positives in sample (rather than absolute number):
      • TP=(#T predicted T)/#T
    • x axis shows percentage of false positives in sample (rather than sample size):
      • FP=(#F predicted T)/#F
      • provided by true but false
  • ROC curves are similar to gain curves
    • “ROC” stands for “receiver operating characteristic”
  • Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel
  • Go throught all sizes of a sample and plot FP vs. TP 


ROC Curves for Two Classifiers 


Comparison Studies of Classification Methods 

  • Large number of classification techniques from the field of ML, NN and Statistics
  • Many empirical comparisons in the 80‘s and 90‘s with contradictory results
  • StatLog Project (mid 90‘s)
    • More than 20 methods and 20 datasets
    • Performance of methods depends very much on the data set
    • No general guidelines
    • Comparison study is advisable in special cases
  • State-of-the-practice
    • Comparisons of several methods on a particular data set
    • Sometimes even automated „mass modeling“
    • Method tanking : 


„Prime Candidates“ – Recommendations from StatLog 

  • Following methods should be considered in real-world comparison studies 

    • Logistic Regression (Discriminant Analysis) 

    •  Decision trees

    • K-Nearest Neighbour

    • Nonparametric statistical methods

    • Neural Networks

  • Decision trees and logistic regression are widely used in practice
    • High performance / low error rate
    • Speed of model building and classification
    • Easy to explain
    • as compared to NN or kNN 


A Real World Example of Classification: Campaign Management 

  • Problem definition:
    • Marketing campaign for online billing
    • Selection of a few tousends from a few million customers, who are potential responders to a campaign
  • Questions
    • Probability of customer response?
    • Which customer attributes are relevant for characterizing a responder? 


Model Selection Criteria - best model

  • Can we find a single ‘best’ model / classifier?
  • Model selection criteria attempt to find a good compromise between:
    • The complexity of a model
    • Its prediction accuracy on the training data
  • Reasoning: a good model is a simple model that achieves high accuracy on the given data
  • Also known as Occam’s Razor: the best theory is the smallest one that describes all the facts 


Elegance vs. Errors 

  • Theory 1: very simple, elegant theory that explains the data almost perfectly
  • Theory 2: significantly more complex theory that reproduces the data without mistakes
  • Theory 1 is probably preferable
  • Classical example: Kepler’s three laws on planetary motion
    • Less accurate than Copernicus’s latest refinement of the Ptolemaic theory of epicycles 


The MDL Principle 

  • MDL principle is a model selection criterion
    • MDL stands for minimum description length
  • The description length is defined as:
    • space required to describe a theory
    • +
    • space required to describe the theory’s mistakes
  • In our case the theory is the classifier and the mistakes are the errors on the training data
  • Aim: we want a classifier with minimal DL


Inductive Learning Systems 


Randomness vs. Regularity 

  • 0110001101...
    • random string=incompressible=maximal information
  • 010101010101010101
    • regularity of repetition allows compression
    • If the training set is ‘representative’ then regularities in the training set will be representative for regularities in an unseen test set.
  • A lot of forms of learning can be described as data compression in the following sense: 


Kolmogorov Complexity 

  • The Kolmogorov complexity (K) of a binary object is the length of the shortest program that generates this object on a universal Turing machine
    • Random strings are not compressible
    • A message with low Kolmogorov complexity is compressible
  • K as complexity measure is incomputable
    • So in practical applications it always needs to be approximated, e.g. by Lempel-Ziv (used in zip and unzip) compression or others
  • Ray Solomonoff founded the theory of universal inductive inference, which draws on Kolmogorov complexity. Kolmogorov and Solomonoff invented the concept in parallel. 


Kolmogorov Complexity 

  • Let x be a finite length binary string and U a universal computer.
  • Let U(p) be the output of U when presented with program p.
  • The Kolmogorov complexity of string x is the minimal description length of x.


Overview of LZ Algorithms 

  • The Lempel-Ziv algorithm can be used as an approximation of the Kolmogorov complexity.
  • Demonstration
    • Take a simple alphabet containing only two letters, a and b
    • Create a sample stream of text
  • aaababbbaaabaaaaaaabaabb 


LZ Algorithms 

  • Rule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far.
    • aaababbbaaabaaaaaaabaabb
    • a|aa|b|ab|bb|aaa|ba|aaaa|aab|aabb  



Before compression, the pieces of text from the breaking- down process are indexed from 1 to n: 



  • Indices are used to number the pieces of data.
    • The empty string (start of text) has index 0.
    • The piece indexed by 1 is a. Thus a, together with the initial string, must be numbered Oa.
    • String 2, aa, will be numbered 1a, because it contains a, whose index is 1, and the new character a. 


Good Compression? 

  • In a long string of text, the number of bits needed to transmit the coded information is small compared to the actual length of the text.
  • Example: 16 bits to transmit the code 2b instead of 24 bits (8 + 8 + 8) needed for the actual text aab. 


MDL-approach to Learning 

  • Occam’s Razor
    • “Among equivalent models choose the simplest one.”
  • Minimum Description Length (MDL)
    • “Select model that describes data with minimal #bits.”
    • model = shortest program that outputs data
    •  length of program = Kolmogorov Complexity
  • Learning = finding regularities = compression