Chapter 5: Naive Bayes Flashcards

1
Q

Formal Definition of Classification

A
  • Naive Byes = form of classification
  • Classification: Given a database 𝐷 = {π‘₯1, π‘₯2, … , π‘₯𝑛} of tuples (items, records) and a set of classes 𝐢 = {𝐢1, 𝐢2, … , πΆπ‘š}, the classification problem is to define a mapping 𝑓: 𝐷 𝐢 where each xi is assigned to one class. A class, 𝐢𝑗, contains precisely those tuples mapped to it; that is,
    𝐢𝑗 = {π‘₯𝑖|𝑓(π‘₯𝑖) = 𝐢𝑗, 1 𝑖 𝑛, and π‘₯𝑖 element 𝐷}.
    • The logistic regression can be used for classification.
  • Prediction is similar, but usually implies a mapping to numeric values instead of a class 𝐢𝑗
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Example Applications

A
  • Determine if a bank customer for a loan is a low, medium, or high risk customer.
  • Churn prediction (typically a classification task). Leaves or not ) Determine if a sequence of credit card purchases indicates questionable behavior.
  • Identify customers that may be willing to purchase particular insurance policies.
  • Identify patterns of gene expression that indicate the patient has cancer.
  • Identify spam mail etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Algorithms for Classification

A
  • (Logistic) Regression
  • Rudimentary Rules (e.g., 1R)
  • Statistical Modeling (e.g., Naïve Bayes)
  • Decision Trees: Divide and Conquer
  • Classification Rules (e.g. PRISM)
  • Instance-Based Learning (e.g. kNN)
  • Support Vector Machines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

1-Rule (1R)

A
  • Generate a one-level decision tree
  • One attribute – easy to explain
  • Basic idea:
    • Rules β€œtesting” a single attribute
    • Classify according to frequency in training data
    • Evaluate error rate for each attribute
    • Choose the best attribute
    • That’s all!
  • Performs quite well, even compared to much more sophisticated algorithms!
    • Often the underlying structure of data is quite simple

β€œVery simple classification rules perform well on most commonly used datasets” Holte, 1993

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Apply 1R on weather data

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Other Features of naive bayes

A
  • Missing Values
    • Include β€œdummy” attribute value missing
  • Numeric Values
    • Discretization :
      • Sort training data by attribute value
      • Split range of numeric temperature values into categories
      • Threshold values 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84 between eight categories
      • 72 can be removed and replaced by 73.5 to get a bigger class
      • Problem of too many temperature classes -> define a min class size
  • Merge adjacent partitions having the same majority class
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Naïve Bayes Classifier

A
  • 1R uses only a single attribute for classification
  • Naive Bayes classifier allows all attributes to contribute equally
  • Assumes
    • All attributes equally important
    • All attributes independent
      • This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute
  • Although based on assumptions that are almost never correct, this scheme works well in practice!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bayes Theorem: Some Notation

A
  • Let 𝑃(𝑒) represent the prior or unconditional probability that proposition 𝑒 is true.
    • Example: Let 𝑒 represent that a customer is a high credit risk. 𝑃(𝑒) = 0.1 means that there is a 10% chance a given customer is a high credit risk.
  • Probabilities of events change when we know something about the world
    • The notation 𝑃(𝑒|h) is used to represent the conditional or posterior probability of 𝑒
    • Read β€œthe probability of e given that all we know is h.”
      ​- 𝑃(𝑒 = h𝑖𝑔h π‘Ÿπ‘–π‘ π‘˜| h = π‘’π‘›π‘’π‘šπ‘π‘™π‘œπ‘¦π‘’π‘‘) = 0.60
  • The notation 𝑃(𝐸) is used to represent the probability distribution of all possible values of a random variable
    • 𝑃(π‘…π‘–π‘ π‘˜) = < 0.7,0.2,0.1 >
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Conditional Probability and Bayes Rule

A
  • The product rule for conditional probablities
    • 𝑃 (𝑒 | h) = 𝑃(𝑒 ∩ h)/𝑃(h)
    • 𝑃(𝑒 ∩ h) = 𝑃(𝑒|h)𝑃(h) = 𝑃(h|𝑒)𝑃(𝑒) (product rule)
    • 𝑃(𝑒 ∩ h) = 𝑃(𝑒)𝑃(h) (for independent random variables)
  • Bayes’ rule relates conditional probabilities
    • 𝑃(𝑒 ∩ h) = 𝑃(𝑒|h)𝑃(h)
    • 𝑃(𝑒 ∩ h) = 𝑃(h|𝑒)𝑃(𝑒)
  • P(h |e) = P(e |h)P(h) / P(e)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Bayes theorem

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bayes cancer example

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Frequency Tables

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Naive Bayes – Probabilities

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Predicting a New Day

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Naive Bayes - Summary

A
  • Want to classify a new instance (𝑒1, 𝑒2, … , 𝑒𝑛) into finite number of categories from the set h.
    • Choose the most likely classification using
    • Bayes theorem MAP (maximum a posteriori classification)
    • Are assumptions stronger or less stronger? All attributes are treated equally. So be careful which attribute to include. Maybe it is without any relation and corrupts the data
  • Assign the most probable category h𝑀𝐴𝑃 given (𝑒1, 𝑒2, … , 𝑒𝑛), i.e. the and maximum likelihood
  • β€œNaive Bayes” since the attributes are treated as independent only then you can multiply the probabilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Bayesian Classifier – The Zero Frequency Problem

A
17
Q

Bayesian Classifier – The Zero Frequency Problem - Solution

A
  • What if an attribute value doesn’t occur with every class value P(Outlook = overcast | no)=0
  • Remedy: add 1 to the numerator for every attribute value-class combination, and the probability can never be zero
    *
18
Q

Missing Values

A
  • Training: instance is not included in frequency count for attribute value- class combination
  • Classification: attribute will be omitted from calculation Example:
19
Q

Dealing with Numeric Attributes

A
  • Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
  • The probability density function for the normal distribution is defined by two parameters:
    • The sample mean:
    • The standard deviation :
    • The density function f(x):
20
Q

Numeric Data: Unknown Distribution

A
  • What if the data distribution#does not follow a known distribution.
  • In this case we need a mechanism to estimate the density distribution.
  • A simple and intuitive approach is based on kernel density estimation. ( Model irrgolor pnpasility density function
  • Consider a random variable 𝑋 whose distribution 𝑓(𝑋) is unknown but a sample with a non-uniform distribution
21
Q

Kernel Density Estimation

A
  • We want to derive a function 𝑓(π‘₯) such that
  • (1) 𝑓(π‘₯) is a probability density function, i.e.
    f (x)dx= 1
  • (2) 𝑓(π‘₯) is a smooth approximation of the data points in 𝑋
  • (3) 𝑓(π‘₯) can be used to estimate values π‘₯* which are not in
22
Q

Discussion of Naïve Bayes

A
  • Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)
  • Domingos, Pazzani: On the Optimality of the Simple Bayesian Classifier under Zero-One-Loss, Machine Learning (1997) 29.
  • However: adding too many redundant attributes will cause problems (e.g. identical attributes)
    • Note also: many numeric attributes are not normally distributed Time complexity
  • Calculating conditional probabilities: Time 𝑂(𝑛) where n is the number of instances
  • Calculating the class: Time 𝑂(π‘π‘Ž) where c is the number of classes, a the attributes
23
Q

Bayesian (Belief) Networks: Multiple Variables with Dependency

A
  • Naïve Bayes assumption of conditional independence is often too restrictive
  • Bayesian Belief network (Bayesian net) describe conditional independence among subsets of attributes: combining prior knowledge about dependencies among variables with observed training data.
  • Graphical representation: directed acyclic graph (DAG), one node for each attribute
    • Overall probability distribution factorized into component distributions
    • Graph’s nodes hold component distributions (conditional distributions)
24
Q

Probability Laws

A
25
Q
A