Supervised Classification 7 Flashcards
Give examples of Text Classification. Three or more.
Think about how you would classify a piece of text or a book
- Assigning subject categories, topics or genres
- Spam detection
- Authorship identification
- Age/gender identification
- Language identification
- Sentiment analysis
What is Rule-Based Classification?
Rules based on combinations of words or other features
• Spam: whether mentions my name, whether mentions money, feature
phrases like ‘you are selected’, ‘this is not a spam’
• POS tagging: prefixes (inconvenient, irregular), suffixes (friendly, quickly),
upper case letters, 35-year
• Sentiment analysis: ‘rock’, ‘appreciation’, ‘marvel’, ‘masterpiece’
• Gender identification from names: number of vowels, the ending letter
• Accuracy can be high
• If the features are well designed and selected by experts
• However, building and maintaining these rules is expensive!
What is Text Classification?
Input • A document d • A fixed set of classes C = {C1, C2, … , Cm} • Output • The predicted class c ∈ C for d
What is Supervised Classification?
Input
• A document d
• A fixed set of classes C = {C1, C2, … , Cm}
• A training set of N hand-labeled documents {(d1 , c1), … , (dN , cN)}
Output
• A classifier such that for any document, it predicts its class
When is a classifier supervised?
• A classifier is called supervised if it is built based on training corpora containing the correct label for each input
What is a dev-test set for?
analyze errors, select
features, optimize
hyper-parameters
What is a test set for?
test on held-out data
(model should not be
optimized to this data!)
What is the training model for?
Train the model
What is a multinomial logistic regression?
It’s a classification algorithm
• Predicts the probability of the input falling into a category
How do you calculate the similarity between the prediction at the truth labels in multinomial logistic regression?
Loss function e.g. cross entropy loss CE(P1,P2) = − ∑P1(xi) log P2(xi) Cross entropy loss is not symmetric! CE(P1,P2) = -∑I P1(xi) log P2(xi)
See slide 7 pg. 31
What does more features in ML algorithms do?
- More features usually allows for more flexibility, hence better performance at training time
- But using more features also usually increase the chance of overfitting
What are the issues with vocabulary size in TF and TF-IDF?
Why it harms • Over-parameterization, overfitting • Increases both computational and representational expenses • Introduces many ‘noisy features’, which may harm the performance (especially when raw tf/idf values are used)
What are some methods to reduce the vocabulary size in TF(-IDF)?
• Remove extremely common words, e.g. stop words and punctuations
• Remove extremely uncommon words, i.e. words that only appear in very few documents
Among the rest, you may:
• Select the top TF words, because they are more representative
• Select the top IDF words, because they are more informative
• Select the top TF-IDF words, to strike a balance
Why do you decide your vocabulary at training time, and keep it fixed at test time!?
Because your model does not understand what each feature means; it relies on the position of each feature to learn the importance/weight of each feature
How do you calculate accuracy in a model performance evaluation?
Accuracy
• Of all predictions, how many are correct
• Acc=(TP+TN)/(TP+FP+FN+TN)
When the label distribution is highly unbalanced, you can be easily fooled by the accuracy. How do you avoid being fooled?
• Report the accuracy of the ‘simple
majority baseline
• Check the label distribution of the
training data
How do you calculate precision in a model performance evaluation?
Precision:
• % of labelled items that are correct
• TP/(TP+FP)
If precision is 0 it is N/A not 0
How do you calculate recall in a model performance evaluation?
Recall:
• % of correct items that have
been labelled
• TP/(TP+FN)
What are the qualities of an aggressive classifier?
• Tend to label more items • High-recall low-precision • When you don’t want to miss any spam; suitable for first round filtering (shortlist)
What are the qualities of a conservative classifier?
• Tend to label fewer items; only label the very certain ones • High-precision low-recall • When you don’t want any false alarms; suitable for second round selection
What is F1 Score/F Measure and how is it calculated?
It is the combined weighted precision and weighted recall.
1/2 (equally weight P and R): F1=2PR/(P+R)
What are ways to deal with imbalanced data?
• class weights: assign higher weights to the minority class, i.e., use higher loss function if a minority item is misclassified as majority
Down-sampling: sample the majority class to make their frequencies closer to the rarest class. Use the sampled subsets of data to train a model • Pros: easy to implement; allows many different sampling methods • Cons: smaller training data size; sometimes poor performance on real data (with real class distributions)
• Up-sampling: the minority class is resampled to increase the corresponding frequencies • In NLP, it means you need to create some new text of the minority class. This is also known as data augmentation.
To select the most likely class for a given document what do you do?
Given an input document, a classifier assigns a probability to each
class, P(c|d), and selects the most likely one: c = arg max(ci) P(d | ci)P(ci)
Go over chapter 7 slide 19 to 24
Go over chapter 7 slide 19 to 24