Natural Language Processing Flashcards

(25 cards)

1
Q

NLP

What are three techniques for handeling word variation?

A

Tokenisation, Stemming, Lemmatisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Tokenisation?

A

break a sentence or paragraph into individual words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is stemming?

A

reducing a word to its root form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is lemmatisation?

A

reducing a word to its lemma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between stemming and lemmatisation?

A

Stemming simple cuts off any additional characters used to added context (adjustable -> adjust) whereas lemmatisation will change the word to its base case (better -> good, was -> (to) be)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a “bag-of-words” representation?

A

All words in the corpus are considered without context and only as single items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does TF-IDF stand for?

A

term frequency - inverse document frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are keywords identified?

A

Calculate the frequency of every term
for each document, calculate all TF-IDF scores
for n keywords, take the top n TF-IDF scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is TF-IDF calculated?

A

Multiply a terms frequency within a document (TF) by its inverse document frequency across a collection of documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is IDF calculated?

A

log(n/nt)
n = number of docs in corpus
nt = number of docs in corpus with term in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is text represented?

A

Using a sparse vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What values are recorded in a sparse matrix?

A
  1. Word Count
  2. Presence of a word
  3. TF-IDF
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is topic modelling used for?

A

Dimensionality reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the advantages of topic modelling?

A
  1. It is language agnostic - only uses a document-term matrix as input
  2. There is no need to know in advance what the topics are
  3. A quick and common way to summarise text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

4 Answers

What are the disadvantages of topic modelling?

A
  1. It is not well defined so it is difficult to generalise to new unseen documents
  2. There are many free parameters so scales linearly
  3. Prone to overfitting
  4. Doesn’t work well for short text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is word embedding?

A

Representing every word as a dense vector

17
Q

5 Answers

What can a word embedding vector represent?

A
  1. Word Morphology
  2. The contexts where a word appears
  3. The global Corpus statistics
  4. Concept Hierarchy
  5. The relationship between a set of documents and the terms in them
18
Q

How are word embeddings used?

A

Word embeddings can be fed to a neural network and used to predict the next word in the sentence

19
Q

How does a continuos bag of words model work?

A

It predicts the next word using multiple other words as context

20
Q

How does the skip-gram model work?

A

Using a single word as context to predict multiple words

21
Q

How can word embeddings be generated?

A

By feeding one-hot encoded words into a single layer neural network the weight vector that maximises prediction correctness is the word embedding of a given word.

22
Q

How are semantic relationships between words represented?

A

They are represented as the relationship between vectors in vector space

23
Q

What is polysemy?

A

When a word has multiple meanings

24
Q

What is one solution for word embeddings and polysemy?

A

One embedding is generated for every sense of the word, known as sense embeddings

25
# 5 Answers What are some applications of sentence embedding?
1. Measuring semantic similarity 2. Clustering Similar texts 3. Paraphrase mining 4. Automatic Translation 5. Text summarisation