Chapter 5: Naive Bayes Flashcards

Question 1

Q

Formal Definition of Classification

Answer

A

Naive Byes = form of classification
Classification: Given a database 𝐷 = {𝑥1, 𝑥2, … , 𝑥𝑛} of tuples (items, records) and a set of classes 𝐶 = {𝐶1, 𝐶2, … , 𝐶𝑚}, the classification problem is to define a mapping 𝑓: 𝐷 𝐶 where each xi is assigned to one class. A class, 𝐶_𝑗, contains precisely those tuples mapped to it; that is,
𝐶_𝑗 = {𝑥_𝑖|𝑓(𝑥_𝑖) = 𝐶_𝑗, 1 𝑖 𝑛, and 𝑥_𝑖 element 𝐷}.
- The logistic regression can be used for classification.
Prediction is similar, but usually implies a mapping to numeric values instead of a class 𝐶𝑗

Question 2

Q

Example Applications

Answer

A

Determine if a bank customer for a loan is a low, medium, or high risk customer.
Churn prediction (typically a classification task). Leaves or not ) Determine if a sequence of credit card purchases indicates questionable behavior.
Identify customers that may be willing to purchase particular insurance policies.
Identify patterns of gene expression that indicate the patient has cancer.
Identify spam mail etc.

Question 3

Q

Algorithms for Classification

Answer

A

(Logistic) Regression
Rudimentary Rules (e.g., 1R)
Statistical Modeling (e.g., Naïve Bayes)
Decision Trees: Divide and Conquer
Classification Rules (e.g. PRISM)
Instance-Based Learning (e.g. kNN)
Support Vector Machines

Question 4

Q

1-Rule (1R)

Answer

A

Generate a one-level decision tree
One attribute – easy to explain
Basic idea:
- Rules “testing” a single attribute
- Classify according to frequency in training data
- Evaluate error rate for each attribute
- Choose the best attribute
- That’s all!
Performs quite well, even compared to much more sophisticated algorithms!
- Often the underlying structure of data is quite simple

“Very simple classification rules perform well on most commonly used datasets” Holte, 1993

Question 5

Q

Apply 1R on weather data

Question 6

Q

Other Features of naive bayes

Answer

A

Missing Values
- Include “dummy” attribute value missing
Numeric Values
- Discretization :
  - Sort training data by attribute value
  - Split range of numeric temperature values into categories
  - Threshold values 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84 between eight categories
  - 72 can be removed and replaced by 73.5 to get a bigger class
  - Problem of too many temperature classes -> define a min class size
Merge adjacent partitions having the same majority class

Question 7

Q

Naïve Bayes Classifier

Answer

A

1R uses only a single attribute for classification
Naive Bayes classifier allows all attributes to contribute equally
Assumes
- All attributes equally important
- All attributes independent
  - This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute
Although based on assumptions that are almost never correct, this scheme works well in practice!

Question 8

Q

Bayes Theorem: Some Notation

Answer

A

Let 𝑃(𝑒) represent the prior or unconditional probability that proposition 𝑒 is true.
- Example: Let 𝑒 represent that a customer is a high credit risk. 𝑃(𝑒) = 0.1 means that there is a 10% chance a given customer is a high credit risk.
Probabilities of events change when we know something about the world
- The notation 𝑃(𝑒|h) is used to represent the conditional or posterior probability of 𝑒
- Read “the probability of e given that all we know is h.”
  - 𝑃(𝑒 = h𝑖𝑔h 𝑟𝑖𝑠𝑘| h = 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑) = 0.60
The notation 𝑃(𝐸) is used to represent the probability distribution of all possible values of a random variable
- 𝑃(𝑅𝑖𝑠𝑘) = < 0.7,0.2,0.1 >

Question 9

Q

Conditional Probability and Bayes Rule

Answer

A

The product rule for conditional probablities
- 𝑃 (𝑒 | h) = 𝑃(𝑒 ∩ h)/𝑃(h)
- 𝑃(𝑒 ∩ h) = 𝑃(𝑒|h)𝑃(h) = 𝑃(h|𝑒)𝑃(𝑒) (product rule)
- 𝑃(𝑒 ∩ h) = 𝑃(𝑒)𝑃(h) (for independent random variables)
Bayes’ rule relates conditional probabilities
- 𝑃(𝑒 ∩ h) = 𝑃(𝑒|h)𝑃(h)
- 𝑃(𝑒 ∩ h) = 𝑃(h|𝑒)𝑃(𝑒)
P(h |e) = P(e |h)P(h) / P(e)

Question 10

Q

Bayes theorem

Question 11

Q

Bayes cancer example

Question 12

Q

Frequency Tables

Question 13

Q

Naive Bayes – Probabilities

Question 14

Q

Predicting a New Day

Question 15

Q

Naive Bayes - Summary

Answer

A

Want to classify a new instance (𝑒1, 𝑒2, … , 𝑒𝑛) into finite number of categories from the set h.
- Choose the most likely classification using
- Bayes theorem MAP (maximum a posteriori classification)
- Are assumptions stronger or less stronger? All attributes are treated equally. So be careful which attribute to include. Maybe it is without any relation and corrupts the data
Assign the most probable category h_𝑀𝐴𝑃 given (𝑒1, 𝑒2, … , 𝑒𝑛), i.e. the and maximum likelihood
“Naive Bayes” since the attributes are treated as independent only then you can multiply the probabilities

Question 16

Q

Bayesian Classifier – The Zero Frequency Problem

Question 17

Q

Bayesian Classifier – The Zero Frequency Problem - Solution

Answer

A

What if an attribute value doesn’t occur with every class value P(Outlook = overcast | no)=0
Remedy: add 1 to the numerator for every attribute value-class combination, and the probability can never be zero
*

Question 18

Q

Missing Values

Answer

A

Training: instance is not included in frequency count for attribute value- class combination
Classification: attribute will be omitted from calculation Example:

Question 19

Q

Dealing with Numeric Attributes

Answer

A

Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
The probability density function for the normal distribution is defined by two parameters:
- The sample mean:
- The standard deviation :
- The density function f(x):

Question 20

Q

Numeric Data: Unknown Distribution

Answer

A

What if the data distribution#does not follow a known distribution.
In this case we need a mechanism to estimate the density distribution.
A simple and intuitive approach is based on kernel density estimation. ( Model irrgolor pnpasility density function
Consider a random variable 𝑋 whose distribution 𝑓(𝑋) is unknown but a sample with a non-uniform distribution

Question 21

Q

Kernel Density Estimation

Answer

A

We want to derive a function 𝑓(𝑥) such that
(1) 𝑓(𝑥) is a probability density function, i.e.
f (x)dx= 1
(2) 𝑓(𝑥) is a smooth approximation of the data points in 𝑋
(3) 𝑓(𝑥) can be used to estimate values 𝑥* which are not in

Question 22

Q

Discussion of Naïve Bayes

Answer

A

Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)
Domingos, Pazzani: On the Optimality of the Simple Bayesian Classifier under Zero-One-Loss, Machine Learning (1997) 29.
However: adding too many redundant attributes will cause problems (e.g. identical attributes)
- Note also: many numeric attributes are not normally distributed Time complexity
Calculating conditional probabilities: Time 𝑂(𝑛) where n is the number of instances
Calculating the class: Time 𝑂(𝑐𝑎) where c is the number of classes, a the attributes

Question 23

Q

Bayesian (Belief) Networks: Multiple Variables with Dependency

Answer

A

Naïve Bayes assumption of conditional independence is often too restrictive
Bayesian Belief network (Bayesian net) describe conditional independence among subsets of attributes: combining prior knowledge about dependencies among variables with observed training data.
Graphical representation: directed acyclic graph (DAG), one node for each attribute
- Overall probability distribution factorized into component distributions
- Graph’s nodes hold component distributions (conditional distributions)

Question 24

Q

Probability Laws

Question 25

Q

Brainscape's Knowledge GenomeTM

Chapter 5: Naive Bayes Flashcards

Brainscape's Knowledge Genome^TM