Supervised Classification 7 Flashcards by Dayo S

Give examples of Text Classification. Three or more.

Think about how you would classify a piece of text or a book

Assigning subject categories, topics or genres
Spam detection
Authorship identification
Age/gender identification
Language identification
Sentiment analysis

How well did you know this?

Not at all

Perfectly

What is Rule-Based Classification?

Rules based on combinations of words or other features
• Spam: whether mentions my name, whether mentions money, feature
phrases like ‘you are selected’, ‘this is not a spam’
• POS tagging: prefixes (inconvenient, irregular), suffixes (friendly, quickly),
upper case letters, 35-year
• Sentiment analysis: ‘rock’, ‘appreciation’, ‘marvel’, ‘masterpiece’
• Gender identification from names: number of vowels, the ending letter
• Accuracy can be high
• If the features are well designed and selected by experts
• However, building and maintaining these rules is expensive!

How well did you know this?

Not at all

Perfectly

What is Text Classification?

Input
• A document d
• A fixed set of classes C = {C1, C2, … , Cm}
• Output
• The predicted class c ∈ C for d

How well did you know this?

Not at all

Perfectly

What is Supervised Classification?

Input
• A document d
• A fixed set of classes C = {C1, C2, … , Cm}
• A training set of N hand-labeled documents {(d1 , c1), … , (dN , cN)}
Output
• A classifier such that for any document, it predicts its class

How well did you know this?

Not at all

Perfectly

When is a classifier supervised?

• A classifier is called supervised if it is built based on training corpora containing the correct label for each input

How well did you know this?

Not at all

Perfectly

What is a dev-test set for?

analyze errors, select
features, optimize
hyper-parameters

How well did you know this?

Not at all

Perfectly

What is a test set for?

test on held-out data
(model should not be
optimized to this data!)

How well did you know this?

Not at all

Perfectly

What is the training model for?

Train the model

How well did you know this?

Not at all

Perfectly

What is a multinomial logistic regression?

It’s a classification algorithm

• Predicts the probability of the input falling into a category

How well did you know this?

Not at all

Perfectly

How do you calculate the similarity between the prediction at the truth labels in multinomial logistic regression?

Loss function
e.g. cross entropy loss
CE(P1,P2) = − ∑P1(xi) log P2(xi)
Cross entropy loss is not symmetric! 
CE(P1,P2) = -∑I P1(xi) log P2(xi)

See slide 7 pg. 31

How well did you know this?

Not at all

Perfectly

What does more features in ML algorithms do?

More features usually allows for more flexibility, hence better performance at training time
But using more features also usually increase the chance of overfitting

How well did you know this?

Not at all

Perfectly

What are the issues with vocabulary size in TF and TF-IDF?

Why it harms 
• Over-parameterization, overfitting 
• Increases both computational and
representational expenses
• Introduces many ‘noisy features’, which
may harm the performance (especially
when raw tf/idf values are used)

How well did you know this?

Not at all

Perfectly

What are some methods to reduce the vocabulary size in TF(-IDF)?

• Remove extremely common words, e.g. stop words and punctuations
• Remove extremely uncommon words, i.e. words that only appear in very few documents
Among the rest, you may:
• Select the top TF words, because they are more representative
• Select the top IDF words, because they are more informative
• Select the top TF-IDF words, to strike a balance

How well did you know this?

Not at all

Perfectly

Why do you decide your vocabulary at training time, and keep it fixed at test time!?

Because your model does not understand what each feature means; it relies on the position of each feature to learn the importance/weight of each feature

How well did you know this?

Not at all

Perfectly

How do you calculate accuracy in a model performance evaluation?

Accuracy
• Of all predictions, how many are correct
• Acc=(TP+TN)/(TP+FP+FN+TN)

How well did you know this?

Not at all

Perfectly

When the label distribution is highly unbalanced, you can be easily fooled by the accuracy. How do you avoid being fooled?

Study These Flashcards

• Report the accuracy of the ‘simple
majority baseline
• Check the label distribution of the
training data

How do you calculate precision in a model performance evaluation?

Study These Flashcards

Precision:
• % of labelled items that are correct
• TP/(TP+FP)
If precision is 0 it is N/A not 0

How do you calculate recall in a model performance evaluation?

Study These Flashcards

Recall:
• % of correct items that have
been labelled
• TP/(TP+FN)

What are the qualities of an aggressive classifier?

Study These Flashcards

• Tend to label more items
• High-recall low-precision
• When you don’t want to miss
any spam; suitable for first
round filtering (shortlist)

What are the qualities of a conservative classifier?

Study These Flashcards

• Tend to label fewer items; only
label the very certain ones
• High-precision low-recall
• When you don’t want any false
alarms; suitable for second
round selection

What is F1 Score/F Measure and how is it calculated?

Study These Flashcards

It is the combined weighted precision and weighted recall.

1/2 (equally weight P and R): F1=2PR/(P+R)

What are ways to deal with imbalanced data?

Study These Flashcards

• class weights: assign higher weights
to the minority class, i.e., use higher
loss function if a minority item is
misclassified as majority

Down-sampling: sample the majority
class to make their frequencies closer
to the rarest class. Use the sampled
subsets of data to train a model
• Pros: easy to implement; allows many
different sampling methods
• Cons: smaller training data size;
sometimes poor performance on real
data (with real class distributions)

• Up-sampling: the minority
class is resampled to increase
the corresponding frequencies
• In NLP, it means you need to
create some new text of the
minority class. This is also
known as data augmentation.

To select the most likely class for a given document what do you do?

Study These Flashcards

Given an input document, a classifier assigns a probability to each
class, P(c|d), and selects the most likely one: c = arg max(ci) P(d | ci)P(ci)

Go over chapter 7 slide 19 to 24

Study These Flashcards

Go over chapter 7 slide 19 to 24

watch logistic regression video chapter 4 vid 3

What is F1 score?

The F-score, also called the F1-score, is a measure of a model's accuracy on a dataset. The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model's precision and recall.

What tool in python is used for f1, accuracy, precision and recall?

sklearn

Supervised Classification 7 Flashcards

(27 cards)