Chapter 5: Naive Bayes Flashcards Preview

Business Analytics > Chapter 5: Naive Bayes > Flashcards

Flashcards in Chapter 5: Naive Bayes Deck (25):

Formal Definition of Classification 

  • Naive Byes = form of classification
  • Classification: Given a database 𝐷 = {𝑥1, 𝑥2, ... , 𝑥𝑛} of tuples (items, records) and a set of classes 𝐶 = {𝐶1, 𝐶2, ... , 𝐶𝑚}, the classification problem is to define a mapping 𝑓: 𝐷 𝐶 where each xi is assigned to one class. A class, 𝐶𝑗, contains precisely those tuples mapped to it; that is,
    𝐶𝑗 = {𝑥𝑖|𝑓(𝑥𝑖) = 𝐶𝑗, 1 𝑖 𝑛, and 𝑥𝑖 element 𝐷}.
    • The logistic regression can be used for classification.
  • Prediction is similar, but usually implies a mapping to numeric values instead of a class 𝐶𝑗 


Example Applications 

  • Determine if a bank customer for a loan is a low, medium, or high risk customer.
  • Churn prediction (typically a classification task). Leaves or not ) Determine if a sequence of credit card purchases indicates questionable behavior.
  • Identify customers that may be willing to purchase particular insurance policies.
  • Identify patterns of gene expression that indicate the patient has cancer.
  • Identify spam mail etc. 


Algorithms for Classification 

  • (Logistic) Regression

  • Rudimentary Rules (e.g., 1R)

  • Statistical Modeling (e.g., Naïve Bayes)

  • Decision Trees: Divide and Conquer

  • Classification Rules (e.g. PRISM)

  • Instance-Based Learning (e.g. kNN)

  • Support Vector Machines


1-Rule (1R) 

  • Generate a one-level decision tree

  • One attribute – easy to explain

  • Basic idea: 

    • Rules "testing" a single attribute

    • Classify according to frequency in training data

    • Evaluate error rate for each attribute

    • Choose the best attribute

    • That’s all!

  • Performs quite well, even compared to much more sophisticated algorithms!

    • Often the underlying structure of data is quite simple

      “Very simple classification rules perform well on most commonly used datasets” Holte, 1993 


Apply 1R on weather data

A image thumb

Other Features of naive bayes

  • Missing Values

    • Include “dummy” attribute value missing

  • Numeric Values

    • Discretization :
      • Sort training data by attribute value
      • Split range of numeric temperature values into categories 
      • Threshold values 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84 between eight categories 

      • 72 can be removed and replaced by 73.5 to get a bigger class

      • Problem of too many temperature classes -> define a min class size
        - Merge adjacent partitions having the same majority class 

A image thumb

Naïve Bayes Classifier 

  • 1R uses only a single attribute for classification
  • Naive Bayes classifier allows all attributes to contribute equally
  • Assumes
    • All attributes equally important
    • All attributes independent
      • This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute
  • Although based on assumptions that are almost never correct, this scheme works well in practice! 


Bayes Theorem: Some Notation 

  • Let 𝑃(𝑒) represent the prior or unconditional probability that proposition 𝑒 is true. 
    • Example: Let 𝑒 represent that a customer is a high credit risk. 𝑃(𝑒) = 0.1 means that there is a 10% chance a given customer is a high credit risk. 

  • Probabilities of events change when we know something about the world

    • The notation 𝑃(𝑒|h) is used to represent the conditional or posterior probability of 𝑒

    • Read “the probability of e given that all we know is h.” 
      ​- 𝑃(𝑒 = h𝑖𝑔h 𝑟𝑖𝑠𝑘| h = 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑) = 0.60 

  • The notation 𝑃(𝐸) is used to represent the probability distribution of all possible values of a random variable

    • 𝑃(𝑅𝑖𝑠𝑘) = < 0.7,0.2,0.1 > 


Conditional Probability and Bayes Rule 

  • The product rule for conditional probablities 
    • 𝑃 (𝑒 | h) = 𝑃(𝑒 ∩ h)/𝑃(h) 

    • 𝑃(𝑒 ∩ h) = 𝑃(𝑒|h)𝑃(h) = 𝑃(h|𝑒)𝑃(𝑒) (product rule)

    • 𝑃(𝑒 ∩ h) = 𝑃(𝑒)𝑃(h) (for independent random variables) 

  • Bayes’ rule relates conditional probabilities

    • 𝑃(𝑒 ∩ h) = 𝑃(𝑒|h)𝑃(h)

    • 𝑃(𝑒 ∩ h) = 𝑃(h|𝑒)𝑃(𝑒)

  • P(h |e) = P(e |h)P(h)  / P(e) 

A image thumb

Bayes theorem

A image thumb

Bayes cancer example

A image thumb

Frequency Tables

A image thumb

Naive Bayes – Probabilities 

A image thumb

Predicting a New Day 

A image thumb

Naive Bayes - Summary 

  • Want to classify a new instance (𝑒1, 𝑒2, ... , 𝑒𝑛) into finite number of categories from the set h.

    • Choose the most likely classification using

    • Bayes theorem MAP (maximum a posteriori classification) 

    • Are assumptions stronger or less stronger? All attributes are treated equally. So be careful which attribute to include. Maybe it is without any relation and corrupts the data

  • Assign the most probable category h𝑀𝐴𝑃 given (𝑒1, 𝑒2, ... , 𝑒𝑛), i.e. the and maximum likelihood

  • “Naive Bayes” since the attributes are treated as independent only then you can multiply the probabilities 


Bayesian Classifier – The Zero Frequency Problem 

A image thumb

Bayesian Classifier – The Zero Frequency Problem - Solution

  • What if an attribute value doesn’t occur with every class value P(Outlook = overcast | no)=0
  • Remedy: add 1 to the numerator for every attribute value-class combination, and the probability can never be zero

A image thumb

Missing Values 

  • Training: instance is not included in frequency count for attribute value- class combination
  • Classification: attribute will be omitted from calculation Example: 

A image thumb

Dealing with Numeric Attributes 

  • Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
  • The probability density function for the normal distribution is defined by two parameters:
    • The sample mean: 
    • The standard deviation : 
    • The density function f(x):  

A image thumb

Numeric Data: Unknown Distribution

  • What if the data distribution#does not follow a known distribution.

  • In this case we need a mechanism to estimate the density distribution.

  • A simple and intuitive approach is based on kernel density estimation. ( Model irrgolor pnpasility density function

  • Consider a random variable 𝑋 whose distribution 𝑓(𝑋) is unknown but a sample with a non-uniform distribution 


Kernel Density Estimation 

  • We want to derive a function 𝑓(𝑥) such that

  • (1)  𝑓(𝑥) is a probability density function, i.e.
    f (x)dx= 1

  • (2)  𝑓(𝑥) is a smooth approximation of the data points in 𝑋

  • (3)  𝑓(𝑥) can be used to estimate values 𝑥* which are not in 

A image thumb

Discussion of Naïve Bayes 

  • Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)

  • Domingos, Pazzani: On the Optimality of the Simple Bayesian Classifier under Zero-One-Loss, Machine Learning (1997) 29.

  • However: adding too many redundant attributes will cause problems (e.g. identical attributes)

    • Note also: many numeric attributes are not normally distributed Time complexity

  • Calculating conditional probabilities: Time 𝑂(𝑛) where n is the number of instances

  • Calculating the class: Time 𝑂(𝑐𝑎) where c is the number of classes, a the attributes 


Bayesian (Belief) Networks: Multiple Variables with Dependency 

  • Naïve Bayes assumption of conditional independence is often too restrictive

  • Bayesian Belief network (Bayesian net) describe conditional independence among subsets of attributes: combining prior knowledge about dependencies among variables with observed training data.

  • Graphical representation: directed acyclic graph (DAG), one node for each attribute

    • Overall probability distribution factorized into component distributions

    • Graph’s nodes hold component distributions (conditional distributions) 


Probability Laws 

A image thumb