Natural Language Processing Flashcards
(25 cards)
NLP
What are three techniques for handeling word variation?
Tokenisation, Stemming, Lemmatisation
What is Tokenisation?
break a sentence or paragraph into individual words
What is stemming?
reducing a word to its root form
What is lemmatisation?
reducing a word to its lemma
What is the difference between stemming and lemmatisation?
Stemming simple cuts off any additional characters used to added context (adjustable -> adjust) whereas lemmatisation will change the word to its base case (better -> good, was -> (to) be)
What is a “bag-of-words” representation?
All words in the corpus are considered without context and only as single items
What does TF-IDF stand for?
term frequency - inverse document frequency
How are keywords identified?
Calculate the frequency of every term
for each document, calculate all TF-IDF scores
for n keywords, take the top n TF-IDF scores
How is TF-IDF calculated?
Multiply a terms frequency within a document (TF) by its inverse document frequency across a collection of documents
How is IDF calculated?
log(n/nt)
n = number of docs in corpus
nt = number of docs in corpus with term in
How is text represented?
Using a sparse vector
What values are recorded in a sparse matrix?
- Word Count
- Presence of a word
- TF-IDF
What is topic modelling used for?
Dimensionality reduction
What are the advantages of topic modelling?
- It is language agnostic - only uses a document-term matrix as input
- There is no need to know in advance what the topics are
- A quick and common way to summarise text
4 Answers
What are the disadvantages of topic modelling?
- It is not well defined so it is difficult to generalise to new unseen documents
- There are many free parameters so scales linearly
- Prone to overfitting
- Doesn’t work well for short text
What is word embedding?
Representing every word as a dense vector
5 Answers
What can a word embedding vector represent?
- Word Morphology
- The contexts where a word appears
- The global Corpus statistics
- Concept Hierarchy
- The relationship between a set of documents and the terms in them
How are word embeddings used?
Word embeddings can be fed to a neural network and used to predict the next word in the sentence
How does a continuos bag of words model work?
It predicts the next word using multiple other words as context
How does the skip-gram model work?
Using a single word as context to predict multiple words
How can word embeddings be generated?
By feeding one-hot encoded words into a single layer neural network the weight vector that maximises prediction correctness is the word embedding of a given word.
How are semantic relationships between words represented?
They are represented as the relationship between vectors in vector space
What is polysemy?
When a word has multiple meanings
What is one solution for word embeddings and polysemy?
One embedding is generated for every sense of the word, known as sense embeddings