Chapter 5: Naive Bayes Flashcards

Question

Answer 1

A

Naive Byes = form of classification
Classification: Given a database 𝐷 = {𝑥1, 𝑥2, … , 𝑥𝑛} of tuples (items, records) and a set of classes 𝐶 = {𝐶1, 𝐶2, … , 𝐶𝑚}, the classification problem is to define a mapping 𝑓: 𝐷 𝐶 where each xi is assigned to one class. A class, 𝐶_𝑗, contains precisely those tuples mapped to it; that is,
𝐶_𝑗 = {𝑥_𝑖|𝑓(𝑥_𝑖) = 𝐶_𝑗, 1 𝑖 𝑛, and 𝑥_𝑖 element 𝐷}.
- The logistic regression can be used for classification.
Prediction is similar, but usually implies a mapping to numeric values instead of a class 𝐶𝑗

Answer 2

A

Determine if a bank customer for a loan is a low, medium, or high risk customer.
Churn prediction (typically a classification task). Leaves or not ) Determine if a sequence of credit card purchases indicates questionable behavior.
Identify customers that may be willing to purchase particular insurance policies.
Identify patterns of gene expression that indicate the patient has cancer.
Identify spam mail etc.

Answer 3

A

Answer 4

A

Generate a one-level decision tree
One attribute – easy to explain
Basic idea:
- Rules “testing” a single attribute
- Classify according to frequency in training data
- Evaluate error rate for each attribute
- Choose the best attribute
- That’s all!
Performs quite well, even compared to much more sophisticated algorithms!
- Often the underlying structure of data is quite simple

“Very simple classification rules perform well on most commonly used datasets” Holte, 1993

Answer 5

A

Answer 6

A

1R uses only a single attribute for classification
Naive Bayes classifier allows all attributes to contribute equally
Assumes
- All attributes equally important
- All attributes independent
  - This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute
Although based on assumptions that are almost never correct, this scheme works well in practice!

Answer 7

A

Let 𝑃(𝑒) represent the prior or unconditional probability that proposition 𝑒 is true.
- Example: Let 𝑒 represent that a customer is a high credit risk. 𝑃(𝑒) = 0.1 means that there is a 10% chance a given customer is a high credit risk.
Probabilities of events change when we know something about the world
- The notation 𝑃(𝑒|h) is used to represent the conditional or posterior probability of 𝑒
- Read “the probability of e given that all we know is h.”
  - 𝑃(𝑒 = h𝑖𝑔h 𝑟𝑖𝑠𝑘| h = 𝑢𝑛𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑑) = 0.60
The notation 𝑃(𝐸) is used to represent the probability distribution of all possible values of a random variable
- 𝑃(𝑅𝑖𝑠𝑘) = < 0.7,0.2,0.1 >

Answer 8

A

The product rule for conditional probablities
- 𝑃 (𝑒 | h) = 𝑃(𝑒 ∩ h)/𝑃(h)
- 𝑃(𝑒 ∩ h) = 𝑃(𝑒|h)𝑃(h) = 𝑃(h|𝑒)𝑃(𝑒) (product rule)
- 𝑃(𝑒 ∩ h) = 𝑃(𝑒)𝑃(h) (for independent random variables)
Bayes’ rule relates conditional probabilities
- 𝑃(𝑒 ∩ h) = 𝑃(𝑒|h)𝑃(h)
- 𝑃(𝑒 ∩ h) = 𝑃(h|𝑒)𝑃(𝑒)
P(h |e) = P(e |h)P(h) / P(e)

Answer 9

A

Want to classify a new instance (𝑒1, 𝑒2, … , 𝑒𝑛) into finite number of categories from the set h.
- Choose the most likely classification using
- Bayes theorem MAP (maximum a posteriori classification)
- Are assumptions stronger or less stronger? All attributes are treated equally. So be careful which attribute to include. Maybe it is without any relation and corrupts the data
Assign the most probable category h_𝑀𝐴𝑃 given (𝑒1, 𝑒2, … , 𝑒𝑛), i.e. the and maximum likelihood
“Naive Bayes” since the attributes are treated as independent only then you can multiply the probabilities

Answer 10

A

Answer 11

A

What if an attribute value doesn’t occur with every class value P(Outlook = overcast | no)=0
Remedy: add 1 to the numerator for every attribute value-class combination, and the probability can never be zero
*

Answer 12

A

Training: instance is not included in frequency count for attribute value- class combination
Classification: attribute will be omitted from calculation Example:

Answer 13

A

Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
The probability density function for the normal distribution is defined by two parameters:
- The sample mean:
- The standard deviation :
- The density function f(x):

Answer 14

A

What if the data distribution#does not follow a known distribution.
In this case we need a mechanism to estimate the density distribution.
A simple and intuitive approach is based on kernel density estimation. ( Model irrgolor pnpasility density function
Consider a random variable 𝑋 whose distribution 𝑓(𝑋) is unknown but a sample with a non-uniform distribution

Answer 15

A

Answer 16

A

Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)
Domingos, Pazzani: On the Optimality of the Simple Bayesian Classifier under Zero-One-Loss, Machine Learning (1997) 29.
However: adding too many redundant attributes will cause problems (e.g. identical attributes)
- Note also: many numeric attributes are not normally distributed Time complexity
Calculating conditional probabilities: Time 𝑂(𝑛) where n is the number of instances
Calculating the class: Time 𝑂(𝑐𝑎) where c is the number of classes, a the attributes

Answer 17

A

Naïve Bayes assumption of conditional independence is often too restrictive
Bayesian Belief network (Bayesian net) describe conditional independence among subsets of attributes: combining prior knowledge about dependencies among variables with observed training data.
Graphical representation: directed acyclic graph (DAG), one node for each attribute
- Overall probability distribution factorized into component distributions
- Graph’s nodes hold component distributions (conditional distributions)

Answer 18

A

(25 cards)