Naive Bayes and Sentiment Classification Flashcards

1
Q

Text categorization

A

The taxt of assigning a label or category to an entire text or document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Sentiment analysis

A

The extraction of sentiment - the positive or negative orientation that a writer expresses toward some object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Spam detection

A

The binary classification taxt of assigning an email to one of the two classes spam or not-spam.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Naive Bayes

A

A probabilistic classifier.

For a given document d, out of all classes c ∈ C the classifier returns the class c^ which has the maximum posterior probability given the document.

c^
= argmax over c ∈ C of P(c | d)
= argmax over c ∈ C of P(d | c) P(c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is Naive Bayes considered a generative model?

A

Because we can read its equation as stating some kind of implicit assumption about how a model is generated:

First a class is sampled from P(c) and then words are generated by sampling from P(d | c).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

2 Simplifying Assumptions of Naive Bayes classifiers

A
  1. Bag of words assumption - position of a word doesn’t matter.
  2. naive Bayes assumption:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Naive Bayes Assumption

A

c^
= argmax over c ∈ C of P(d | c) P(c)

where a document d is a set of features. So the likelihood P(d | c) = P(f₁, f₂, ..., fₙ | c)

Estimating the probability of every possible combination of features (e.g. every possible set of words and positions) would require a huge number of parameters.

The naive bayes assumption is a conditional independence assumption that the probabilities P(fᵢ | c) are independent given the class c, and thus can be ‘naively’ multiplied:

P(f₁, f₂, ..., fₙ | c) = P(f₁| c) P(f₂ | c) ... P( fₙ | c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Naive Bayes Applied

A

positions ← all word positions in test document

cₙ = argmax over c ∈ C of { P(c) ∏ ᵢ P(wᵢ | c) }
where i ∈ positions

To avoid underflow and increase spead, calculations are done in log space:

cₙ = argmax over c ∈ C of { log P(c) Σ ᵢ log P(wᵢ | c) }
where i ∈ positions

By considering features in log space, the predicted class is a linear function of input features. a linear classifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Naive Bayes:

How to learn the probability P(c)?

A

For the class prior P(c), we ask what percentage of documents in our training set are in each class c.

Let Nc be the number of documents in our dtraining data with class c and Ndoc be the total number of documents.

Then ^P(c) = Nc / Ndoc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Naive Bayes:

How to learn the probability P(fᵢ | c)?

A

To learn the probability P(fᵢ | c), we’ll assume a feature is just the existence of a word in the document’s bag of words.

We’ll want P(wᵢ | c), which we compute as the fraction of times the word wᵢ appears among all words in all documents of topic c.

We first concatenate all documents with category c into one big “category c” text. Then we use the frequency of wᵢ in this concatenated document to give a maximum likelihood estimate of the probability:

^P(wᵢ | c) = count(wᵢ , c) / Σ count(w, c) over all words w

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Laplace Smoothing in Naive Bayes

A

To prevent zero probabilities, 1 is added to the counts:

^P(wᵢ | c)
= [ count(wᵢ , c) + 1 ]
/ [ Σ count(w, c) over all words w + |V|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Naive Bayes

What to do with unknown words?

A

Ignore them.

Remove them from the test document and do not include any probability for them at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Naive Bayes

Stop words

A

Very frequent words like ‘the’ and ‘a’ are often ignored.

Removed from both the training and test documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Binary Multinomial Naive Bayes

A

Used for sentiment analysis.

For sentiment classification, whether a word occurs or not seems to matter more than its frequency.

So word counts are clipped at 1 for each document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to deal with negation for Naive Bayes.

A

During text normalisation, prepend the prefix NOT_ to every word after a token of logical negation until the next punctuation mark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sentiment lexicons

A

Lists of words that are pre-annotated with positive or negative sentiment.

17
Q

Language id

A

The task of determining what language a given piece of text is written in.

18
Q

Gold labels

A

Human-defined labels for each document.

19
Q

Confusion matrix

A

A table for visualising how an algorithm performs with respect to the human gold labels, using two dimensions, and each cell labelling a set of possible outcomes.

20
Q

Accuracy

A

The percentage of all observations our system labeled correctly.

21
Q

Precision

A

Measures the percentage of the items that the system detected that are in fact positive.

Precision = TP / (TP + FP)

22
Q

Recall

A

Measures the percentage of items actually present in the input that were correctly identified by the system.

Recall = TP / (TP + FN)

23
Q

F-measure

A

A weighted harmonic mean of the precision and recall.

Fᵦ = (β² + 1) ·P · R / [ β² · P + R ]

β > 1 favors recall, while β < 1 favors precision.

Where P is precision and R is recall

24
Q

F1 Score

A

An F-measure where β=1 and the weights of precision and recall are equally balanced.

F₁ = 2PR / (P + R)

25
Q

Harmonic mean

A

The harmonic mean of a set of numbers is the reciprocal of the arithmetic mean of reciprocals.

HM(a₁, a₂, … aₙ)
= n / ( 1 / a₁ + 1 / a₂ + 1 / aₙ )

26
Q

Why is a Harmonic mean considered a conservative metric?

A

The harmonic mean of two values is closer to the minimum of the two values than the arithmetic mean is.

I.e. it weighs the lower of the two numbers more heavily.

27
Q

Classification evaluation

Macroaveraging

A

We compute the performance for each class and then average over classes.

More appropriate when performance on all classes are equally important.

28
Q

Classification evaluation

Microaveraging

A

We collect the decisions for each class into a single confusion matrix, and then compute precision and recall from that table.

More appropriate when performance on larger classes are more important.

29
Q

Cross-validation

A

In cross-validation, we choose a number k and partition our data into k disjoint subsets called folds.

Now we choose one of those k folds as a test set, train our classifier on the remaining k-1 folds, and then compute the error rate on the test set.

Then we repeat with another fold as the test set, again training on the other k-1 folds.

We do this sampling process k times and average the test set error rate from these k runs to get an average error rate.