Formal Definition of Classification
 Naive Byes = form of classification
 Classification: Given a database 𝐷 = {𝑥1, 𝑥2, ... , 𝑥𝑛} of tuples (items, records) and a set of classes 𝐶 = {𝐶1, 𝐶2, ... , 𝐶𝑚}, the classification problem is to define a mapping 𝑓: 𝐷 𝐶 where each xi is assigned to one class. A class, 𝐶_{𝑗}, contains precisely those tuples mapped to it; that is,
𝐶_{𝑗} = {𝑥_{𝑖}𝑓(𝑥_{𝑖}) = 𝐶_{𝑗}, 1 𝑖 𝑛, and 𝑥_{𝑖} element 𝐷}.
 The logistic regression can be used for classification.
 Prediction is similar, but usually implies a mapping to numeric values instead of a class 𝐶𝑗
𝐶_{𝑗} = {𝑥_{𝑖}𝑓(𝑥_{𝑖}) = 𝐶_{𝑗}, 1 𝑖 𝑛, and 𝑥_{𝑖} element 𝐷}.
 The logistic regression can be used for classification.
Example Applications
 Determine if a bank customer for a loan is a low, medium, or high risk customer.
 Churn prediction (typically a classification task). Leaves or not ) Determine if a sequence of credit card purchases indicates questionable behavior.
 Identify customers that may be willing to purchase particular insurance policies.
 Identify patterns of gene expression that indicate the patient has cancer.
 Identify spam mail etc.
Algorithms for Classification

(Logistic) Regression

Rudimentary Rules (e.g., 1R)

Statistical Modeling (e.g., Naïve Bayes)

Decision Trees: Divide and Conquer

Classification Rules (e.g. PRISM)

InstanceBased Learning (e.g. kNN)

Support Vector Machines
(Logistic) Regression
Rudimentary Rules (e.g., 1R)
Statistical Modeling (e.g., Naïve Bayes)
Decision Trees: Divide and Conquer
Classification Rules (e.g. PRISM)
InstanceBased Learning (e.g. kNN)
Support Vector Machines
1Rule (1R)

Generate a onelevel decision tree

One attribute – easy to explain

Basic idea:

Rules "testing" a single attribute

Classify according to frequency in training data

Evaluate error rate for each attribute

Choose the best attribute

That’s all!

Performs quite well, even compared to much more sophisticated algorithms!

Often the underlying structure of data is quite simple
“Very simple classification rules perform well on most commonly used datasets” Holte, 1993
Generate a onelevel decision tree
One attribute – easy to explain
Basic idea:

Rules "testing" a single attribute

Classify according to frequency in training data

Evaluate error rate for each attribute

Choose the best attribute

That’s all!
Performs quite well, even compared to much more sophisticated algorithms!

Often the underlying structure of data is quite simple
“Very simple classification rules perform well on most commonly used datasets” Holte, 1993
Apply 1R on weather data
Other Features of naive bayes

Missing Values

Include “dummy” attribute value missing

Numeric Values
 Discretization :
 Sort training data by attribute value
 Split range of numeric temperature values into categories

Threshold values 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84 between eight categories

72 can be removed and replaced by 73.5 to get a bigger class

Problem of too many temperature classes > define a min class size
 Merge adjacent partitions having the same majority class
Missing Values

Include “dummy” attribute value missing
Numeric Values
 Discretization :
 Sort training data by attribute value
 Split range of numeric temperature values into categories

Threshold values 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84 between eight categories

72 can be removed and replaced by 73.5 to get a bigger class

Problem of too many temperature classes > define a min class size
 Merge adjacent partitions having the same majority class
Naïve Bayes Classifier
 1R uses only a single attribute for classification
 Naive Bayes classifier allows all attributes to contribute equally
 Assumes
 All attributes equally important
 All attributes independent
 This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute
 Although based on assumptions that are almost never correct, this scheme works well in practice!
 All attributes equally important
 All attributes independent
 This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute
Bayes Theorem: Some Notation
 Let 𝑃(𝑒) represent the prior or unconditional probability that proposition 𝑒 is true.

Example: Let 𝑒 represent that a customer is a high credit risk. 𝑃(𝑒) = 0.1 means that there is a 10% chance a given customer is a high credit risk.

Probabilities of events change when we know something about the world

The notation 𝑃(𝑒h) is used to represent the conditional or posterior probability of 𝑒

Read “the probability of e given that all we know is h.”
 𝑃(𝑒 = h𝑖𝑔h 𝑟𝑖𝑠𝑘 h = 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑) = 0.60

The notation 𝑃(𝐸) is used to represent the probability distribution of all possible values of a random variable

𝑃(𝑅𝑖𝑠𝑘) = < 0.7,0.2,0.1 >

Example: Let 𝑒 represent that a customer is a high credit risk. 𝑃(𝑒) = 0.1 means that there is a 10% chance a given customer is a high credit risk.
Probabilities of events change when we know something about the world

The notation 𝑃(𝑒h) is used to represent the conditional or posterior probability of 𝑒

Read “the probability of e given that all we know is h.”
 𝑃(𝑒 = h𝑖𝑔h 𝑟𝑖𝑠𝑘 h = 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑) = 0.60
The notation 𝑃(𝐸) is used to represent the probability distribution of all possible values of a random variable

𝑃(𝑅𝑖𝑠𝑘) = < 0.7,0.2,0.1 >
Conditional Probability and Bayes Rule
 The product rule for conditional probablities

𝑃 (𝑒  h) = 𝑃(𝑒 ∩ h)/𝑃(h)

𝑃(𝑒 ∩ h) = 𝑃(𝑒h)𝑃(h) = 𝑃(h𝑒)𝑃(𝑒) (product rule)

𝑃(𝑒 ∩ h) = 𝑃(𝑒)𝑃(h) (for independent random variables)

Bayes’ rule relates conditional probabilities

𝑃(𝑒 ∩ h) = 𝑃(𝑒h)𝑃(h)

𝑃(𝑒 ∩ h) = 𝑃(h𝑒)𝑃(𝑒)

P(h e) = P(e h)P(h) / P(e)

𝑃 (𝑒  h) = 𝑃(𝑒 ∩ h)/𝑃(h)

𝑃(𝑒 ∩ h) = 𝑃(𝑒h)𝑃(h) = 𝑃(h𝑒)𝑃(𝑒) (product rule)

𝑃(𝑒 ∩ h) = 𝑃(𝑒)𝑃(h) (for independent random variables)
Bayes’ rule relates conditional probabilities

𝑃(𝑒 ∩ h) = 𝑃(𝑒h)𝑃(h)

𝑃(𝑒 ∩ h) = 𝑃(h𝑒)𝑃(𝑒)
P(h e) = P(e h)P(h) / P(e)
Bayes theorem
Bayes cancer example
Frequency Tables
Naive Bayes – Probabilities
Predicting a New Day
Naive Bayes  Summary

Want to classify a new instance (𝑒1, 𝑒2, ... , 𝑒𝑛) into finite number of categories from the set h.

Choose the most likely classification using

Bayes theorem MAP (maximum a posteriori classification)

Are assumptions stronger or less stronger? All attributes are treated equally. So be careful which attribute to include. Maybe it is without any relation and corrupts the data

Assign the most probable category h_{𝑀𝐴𝑃} given (𝑒1, 𝑒2, ... , 𝑒𝑛), i.e. the and maximum likelihood

“Naive Bayes” since the attributes are treated as independent only then you can multiply the probabilities
Want to classify a new instance (𝑒1, 𝑒2, ... , 𝑒𝑛) into finite number of categories from the set h.

Choose the most likely classification using

Bayes theorem MAP (maximum a posteriori classification)

Are assumptions stronger or less stronger? All attributes are treated equally. So be careful which attribute to include. Maybe it is without any relation and corrupts the data
Assign the most probable category h_{𝑀𝐴𝑃} given (𝑒1, 𝑒2, ... , 𝑒𝑛), i.e. the and maximum likelihood
“Naive Bayes” since the attributes are treated as independent only then you can multiply the probabilities
Bayesian Classifier – The Zero Frequency Problem
Bayesian Classifier – The Zero Frequency Problem  Solution
 What if an attribute value doesn’t occur with every class value P(Outlook = overcast  no)=0
 Remedy: add 1 to the numerator for every attribute valueclass combination, and the probability can never be zero

Missing Values
 Training: instance is not included in frequency count for attribute value class combination
 Classification: attribute will be omitted from calculation Example:
Dealing with Numeric Attributes
 Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
 The probability density function for the normal distribution is defined by two parameters:
 The sample mean:
 The standard deviation :
 The density function f(x):
 The sample mean:
 The standard deviation :
 The density function f(x):
Numeric Data: Unknown Distribution

What if the data distribution#does not follow a known distribution.

In this case we need a mechanism to estimate the density distribution.

A simple and intuitive approach is based on kernel density estimation. ( Model irrgolor pnpasility density function

Consider a random variable 𝑋 whose distribution 𝑓(𝑋) is unknown but a sample with a nonuniform distribution
What if the data distribution#does not follow a known distribution.
In this case we need a mechanism to estimate the density distribution.
A simple and intuitive approach is based on kernel density estimation. ( Model irrgolor pnpasility density function
Consider a random variable 𝑋 whose distribution 𝑓(𝑋) is unknown but a sample with a nonuniform distribution
Kernel Density Estimation

We want to derive a function 𝑓(𝑥) such that

(1) 𝑓(𝑥) is a probability density function, i.e.
f (x)dx= 1

(2) 𝑓(𝑥) is a smooth approximation of the data points in 𝑋

(3) 𝑓(𝑥) can be used to estimate values 𝑥* which are not in
We want to derive a function 𝑓(𝑥) such that
(1) 𝑓(𝑥) is a probability density function, i.e.
f (x)dx= 1
(2) 𝑓(𝑥) is a smooth approximation of the data points in 𝑋
(3) 𝑓(𝑥) can be used to estimate values 𝑥* which are not in
Discussion of Naïve Bayes

Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)

Domingos, Pazzani: On the Optimality of the Simple Bayesian Classifier under ZeroOneLoss, Machine Learning (1997) 29.

However: adding too many redundant attributes will cause problems (e.g. identical attributes)

Note also: many numeric attributes are not normally distributed Time complexity

Calculating conditional probabilities: Time 𝑂(𝑛) where n is the number of instances

Calculating the class: Time 𝑂(𝑐𝑎) where c is the number of classes, a the attributes
Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)
Domingos, Pazzani: On the Optimality of the Simple Bayesian Classifier under ZeroOneLoss, Machine Learning (1997) 29.
However: adding too many redundant attributes will cause problems (e.g. identical attributes)

Note also: many numeric attributes are not normally distributed Time complexity
Calculating conditional probabilities: Time 𝑂(𝑛) where n is the number of instances
Calculating the class: Time 𝑂(𝑐𝑎) where c is the number of classes, a the attributes
Bayesian (Belief) Networks: Multiple Variables with Dependency

Naïve Bayes assumption of conditional independence is often too restrictive

Bayesian Belief network (Bayesian net) describe conditional independence among subsets of attributes: combining prior knowledge about dependencies among variables with observed training data.

Graphical representation: directed acyclic graph (DAG), one node for each attribute

Overall probability distribution factorized into component distributions

Graph’s nodes hold component distributions (conditional distributions)
Naïve Bayes assumption of conditional independence is often too restrictive
Bayesian Belief network (Bayesian net) describe conditional independence among subsets of attributes: combining prior knowledge about dependencies among variables with observed training data.
Graphical representation: directed acyclic graph (DAG), one node for each attribute

Overall probability distribution factorized into component distributions

Graph’s nodes hold component distributions (conditional distributions)
Probability Laws