Natural Language Processing Flashcards by Matthew Gilbert

NLP

What are three techniques for handeling word variation?

Tokenisation, Stemming, Lemmatisation

How well did you know this?

Not at all

Perfectly

What is Tokenisation?

break a sentence or paragraph into individual words

How well did you know this?

Not at all

Perfectly

What is stemming?

reducing a word to its root form

How well did you know this?

Not at all

Perfectly

What is lemmatisation?

reducing a word to its lemma

How well did you know this?

Not at all

Perfectly

What is the difference between stemming and lemmatisation?

Stemming simple cuts off any additional characters used to added context (adjustable -> adjust) whereas lemmatisation will change the word to its base case (better -> good, was -> (to) be)

How well did you know this?

Not at all

Perfectly

What is a “bag-of-words” representation?

All words in the corpus are considered without context and only as single items

How well did you know this?

Not at all

Perfectly

What does TF-IDF stand for?

term frequency - inverse document frequency

How well did you know this?

Not at all

Perfectly

How are keywords identified?

Calculate the frequency of every term
for each document, calculate all TF-IDF scores
for n keywords, take the top n TF-IDF scores

How well did you know this?

Not at all

Perfectly

How is TF-IDF calculated?

Multiply a terms frequency within a document (TF) by its inverse document frequency across a collection of documents

How well did you know this?

Not at all

Perfectly

How is IDF calculated?

log(n/nt)
n = number of docs in corpus
nt = number of docs in corpus with term in

How well did you know this?

Not at all

Perfectly

How is text represented?

Using a sparse vector

How well did you know this?

Not at all

Perfectly

What values are recorded in a sparse matrix?

Word Count
Presence of a word
TF-IDF

How well did you know this?

Not at all

Perfectly

What is topic modelling used for?

Dimensionality reduction

How well did you know this?

Not at all

Perfectly

What are the advantages of topic modelling?

It is language agnostic - only uses a document-term matrix as input
There is no need to know in advance what the topics are
A quick and common way to summarise text

How well did you know this?

Not at all

Perfectly

4 Answers

What are the disadvantages of topic modelling?

It is not well defined so it is difficult to generalise to new unseen documents
There are many free parameters so scales linearly
Prone to overfitting
Doesn’t work well for short text

How well did you know this?

Not at all

Perfectly

What is word embedding?

Study These Flashcards

Representing every word as a dense vector

5 Answers

What can a word embedding vector represent?

Study These Flashcards

Word Morphology
The contexts where a word appears
The global Corpus statistics
Concept Hierarchy
The relationship between a set of documents and the terms in them

How are word embeddings used?

Study These Flashcards

Word embeddings can be fed to a neural network and used to predict the next word in the sentence

How does a continuos bag of words model work?

Study These Flashcards

It predicts the next word using multiple other words as context

How does the skip-gram model work?

Study These Flashcards

Using a single word as context to predict multiple words

How can word embeddings be generated?

Study These Flashcards

By feeding one-hot encoded words into a single layer neural network the weight vector that maximises prediction correctness is the word embedding of a given word.

How are semantic relationships between words represented?

Study These Flashcards

They are represented as the relationship between vectors in vector space

What is polysemy?

Study These Flashcards

When a word has multiple meanings

What is one solution for word embeddings and polysemy?

Study These Flashcards

One embedding is generated for every sense of the word, known as sense embeddings

# 5 Answers What are some applications of sentence embedding?

1. Measuring semantic similarity 2. Clustering Similar texts 3. Paraphrase mining 4. Automatic Translation 5. Text summarisation

Natural Language Processing Flashcards

(25 cards)