Supervised Classification 7 Flashcards

1
Q

Give examples of Text Classification. Three or more.

Think about how you would classify a piece of text or a book

A
  • Assigning subject categories, topics or genres
  • Spam detection
  • Authorship identification
  • Age/gender identification
  • Language identification
  • Sentiment analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Rule-Based Classification?

A

Rules based on combinations of words or other features
• Spam: whether mentions my name, whether mentions money, feature
phrases like ‘you are selected’, ‘this is not a spam’
• POS tagging: prefixes (inconvenient, irregular), suffixes (friendly, quickly),
upper case letters, 35-year
• Sentiment analysis: ‘rock’, ‘appreciation’, ‘marvel’, ‘masterpiece’
• Gender identification from names: number of vowels, the ending letter
• Accuracy can be high
• If the features are well designed and selected by experts
• However, building and maintaining these rules is expensive!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Text Classification?

A
Input
• A document d
• A fixed set of classes C = {C1, C2, … , Cm}
• Output
• The predicted class c ∈ C for d
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Supervised Classification?

A

Input
• A document d
• A fixed set of classes C = {C1, C2, … , Cm}
• A training set of N hand-labeled documents {(d1 , c1), … , (dN , cN)}
Output
• A classifier such that for any document, it predicts its class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When is a classifier supervised?

A

• A classifier is called supervised if it is built based on training corpora containing the correct label for each input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a dev-test set for?

A

analyze errors, select
features, optimize
hyper-parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a test set for?

A

test on held-out data
(model should not be
optimized to this data!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the training model for?

A

Train the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a multinomial logistic regression?

A

It’s a classification algorithm

• Predicts the probability of the input falling into a category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you calculate the similarity between the prediction at the truth labels in multinomial logistic regression?

A
Loss function
e.g. cross entropy loss
CE(P1,P2) = − ∑P1(xi) log P2(xi)
Cross entropy loss is not symmetric! 
CE(P1,P2) = -∑I P1(xi) log P2(xi)

See slide 7 pg. 31

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does more features in ML algorithms do?

A
  • More features usually allows for more flexibility, hence better performance at training time
  • But using more features also usually increase the chance of overfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the issues with vocabulary size in TF and TF-IDF?

A
Why it harms 
• Over-parameterization, overfitting 
• Increases both computational and
representational expenses
• Introduces many ‘noisy features’, which
may harm the performance (especially
when raw tf/idf values are used)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some methods to reduce the vocabulary size in TF(-IDF)?

A

• Remove extremely common words, e.g. stop words and punctuations
• Remove extremely uncommon words, i.e. words that only appear in very few documents
Among the rest, you may:
• Select the top TF words, because they are more representative
• Select the top IDF words, because they are more informative
• Select the top TF-IDF words, to strike a balance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why do you decide your vocabulary at training time, and keep it fixed at test time!?

A

Because your model does not understand what each feature means; it relies on the position of each feature to learn the importance/weight of each feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you calculate accuracy in a model performance evaluation?

A

Accuracy
• Of all predictions, how many are correct
• Acc=(TP+TN)/(TP+FP+FN+TN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When the label distribution is highly unbalanced, you can be easily fooled by the accuracy. How do you avoid being fooled?

A

• Report the accuracy of the ‘simple
majority baseline
• Check the label distribution of the
training data

17
Q

How do you calculate precision in a model performance evaluation?

A

Precision:
• % of labelled items that are correct
• TP/(TP+FP)
If precision is 0 it is N/A not 0

18
Q

How do you calculate recall in a model performance evaluation?

A

Recall:
• % of correct items that have
been labelled
• TP/(TP+FN)

19
Q

What are the qualities of an aggressive classifier?

A
• Tend to label more items
• High-recall low-precision
• When you don’t want to miss
any spam; suitable for first
round filtering (shortlist)
20
Q

What are the qualities of a conservative classifier?

A
• Tend to label fewer items; only
label the very certain ones
• High-precision low-recall
• When you don’t want any false
alarms; suitable for second
round selection
21
Q

What is F1 Score/F Measure and how is it calculated?

A

It is the combined weighted precision and weighted recall.

1/2 (equally weight P and R): F1=2PR/(P+R)

22
Q

What are ways to deal with imbalanced data?

A
• class weights: assign higher weights
to the minority class, i.e., use higher
loss function if a minority item is
misclassified as majority
Down-sampling: sample the majority
class to make their frequencies closer
to the rarest class. Use the sampled
subsets of data to train a model
• Pros: easy to implement; allows many
different sampling methods
• Cons: smaller training data size;
sometimes poor performance on real
data (with real class distributions)
• Up-sampling: the minority
class is resampled to increase
the corresponding frequencies
• In NLP, it means you need to
create some new text of the
minority class. This is also
known as data augmentation.
23
Q

To select the most likely class for a given document what do you do?

A

Given an input document, a classifier assigns a probability to each
class, P(c|d), and selects the most likely one: c = arg max(ci) P(d | ci)P(ci)

24
Q

Go over chapter 7 slide 19 to 24

A

Go over chapter 7 slide 19 to 24

25
watch logistic regression video chapter 4 vid 3
watch logistic regression video chapter 4 vid 3
26
What is F1 score?
The F-score, also called the F1-score, is a measure of a model's accuracy on a dataset. The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model's precision and recall.
27
What tool in python is used for f1, accuracy, precision and recall?
sklearn