Natural Language Processing Flashcards

(34 cards)

1
Q

What is Natural Language Processing?

A

The act of handling natural language using some form of computational model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is text language similarity?

A

Detection of language from words in a script

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is sentiment analysis?

A

Classifying the emotional content of a message

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is topic extraction?

A

Detection of a topic from a script

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is text summarisation?

A

Summarisation of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is relationship extraction?

A

Extracting the relationships between objects in text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is question answering?

A

Answering questions provided to an NLP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is language generation?

A

Generating language given a question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Bayesian spam detection?

A

A method of detecting the possibility that a message is spam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is Bayesian spam detection calculated?

A

The multiplicative probability that each word in a message is spam, multiplied by the probability of spam, divided by the probability of a message

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is word variation?

A

The fact that two words can have the same root/lemma but have different meanings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the lemma of a word?

A

The ‘root’ of that word - e.g. ‘speak’ and ‘speaks’ have the same lemma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is tokenisation?

A

Separating a sentence or paragraph into different tokens or words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is stemming?

A

Reducing a word to its root form by removing any prefixes and suffixes - e.g. ‘changing’ -> ‘chang’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is lemmatisation?

A

Reducing a word to its original meaning - e.g. ‘was’ -> ‘to be’, ‘changing’ -> ‘change’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why may tokenisation be different in English compared to other languages?

A

Two words may be semantically linked - e.g. in Vietnamese, ‘thoi gian’ means a period of time, but separately they are unrelated (‘thoi’ means a shuttle or buffet and ‘gian’ means time) - or a single token may mean multiple words - e.g. in Japanese, 姉 (or ‘ane’) means ‘older sister’ while 妹 (or ‘imōto’) means ‘younger sister’, but ‘sisters’ is a compound of the two: 姉妹 (pronounced ‘shimai’)

17
Q

What is meant by a ‘bag of words’ representation?

A

An entire corpus/document of words is represented simply as the frequency of their stemmed or lemmatised meanings

18
Q

What is meant by a ‘corpus’ in relation to word processing?

A

All of the text being used, or a document in its entirety

19
Q

How do word frequency distributions typically look?

A

Like a long tail, with connectives at the front and descriptions at the end

20
Q

To represent data among multiple documents, we could use…

A

A document-word frequency matrix

21
Q

We can measure how much a word relates to a type of document by…

A

Dividing its frequency in that document types by its frequency in other document types

22
Q

What is one of the main problems with ‘bag of words’ representations?

A

The matrix generated is typically too sparse - there are too many zeroes

23
Q

What is a mathematical topic model used for?

A

Discovering the different topics in a collection of documents

24
Q

What is Latent Dirichlet Allocation (LDA)?

A

Separating a document-term matrix into two separate matrices - document-topic, and term-topics - like in 2nd normal form!

25
What does Latent Dirichlet Allocation (LDA) achieve?
The separation of words into topics, grouping them together to reduce dimensionality
26
What is one problem with topic modelling?
It is not great for generative modelling, and does not work well for short texts with little topics to discussW
27
What is word embedding?
Taking every word as a dense vector, where we represent each word with some continuous representation - e.g. the probability that a word will appear in a certain context
28
How can we perform word embedding?
Create one-hot vectors representing each word and their context, then provide them to an autoencoder network to get embedding values
29
30
What can we do to solve the issues posed by single-word encoding (predicting only one word ahead)?
Predicting multiple words ahead for each y instead, which is called a skip-gram
31
Embedded vectors translate semantic relationships to...
Vector relationships
32
What is meant by polysemy?
Words that have multiple meanings - e.g. queen, queen or Queen (the ruler, the poker card or the band)
33
What are sense embeddings?
Unique lemmatisations that help separate the meanings of the words - e.g. "Queen" -> "Queen, the band" and "queen" -> "queen poker card"
34
What is one way we can adjust our word embeddings for corpora (plural of corpus) with larger scales?
By using sentence embeddings, or even document embeddings, that perform the same sort of task but over a larger corpus!