L15 - NLP Flashcards

1
Q

Define Bayesian

A

Statistical method that assigns probabilities to events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is NLP?

A

Natural Language Processing is the subset of AI tasked with analysing and interpreting human language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give some examples of how NLP is used?

A

Text summation
Sentiment analysis
Topic extraction
Question answering
Spam detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the bayesian spam detection formula?

A

P( message | spam ) = P(spam|message)P(spam ) / P(message)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is NLP used for spam detection?

A

Historic data of both spam and non-spam emails is fed into some NLP model.

Creates a (message X word) matrix where each row is a message, each col is a word, and the value of the count of that word in the message.

From this matrix, classifications can be made regarding whether a message containing certain words is spam or not. This is the NLP classification model.

New message can be run through the model to determine whether it’s spam.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain how NLP can be used to determine whether a block of text is lyrics from a rock song…

A
  1. Download a large set of rock song lyrics.
  2. Break lyrics down into a large set of words.
  3. Perform Stemming or Lemmatisation on each word.
  4. Count all occurances of words
  5. This can be used to predict whether other lyrics are a rock song or not.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define Stemming…

A

Process of reducing a word to its root form by removing prefixes of suffixes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define Lemmatisation…

A

Reducing words to their lemma (base form), which is the contextual root of the word.

For example, the lemma of studies if study. Whereas the stem of studies is studi ( not a real world )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why do we need stemming, lemmatisation or tokenisation?

A

Groups words or phrases with similar meanings, thus reducing the dimensional vector space of the models, and improves computability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an issue with stemming that lemmatisation solves?

A

Stemming can lead to non-real worlds. E.g Studies -> Studi

In contrast, the lemma of studies is study.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What issue does Stemming and Lematisation solve?

A

High variation of similar words can lead to high dimensionality in the vector space. This degrades computation efficiency.

Thus, these reduce the dimensionality by grouping similar words to the same numeric value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define tokenisation…

A

Break a sentence or paragraph into individual words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Give the Lemma of the following: was, changing, better, worse

A

be
change
good
bad

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why can tokenisation from english to another language cause issues?

A

Because other languages may not have the same sentence structure, and may have numerous to convey 1 word in english.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is bag-of-words representation?

A

Vector based way of representing a text. Creates a Word Matrix containing all counts of each word in the text.

Each element int he vector represents the count of that word.

Columns contain all words in the text.

Each row is a sentence.

Can lead to sparse vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does Bag-o-words usually result in?

A

Long tail distribution.

Common words have high count, less common have low count.

Common words such as ‘the’, ‘a’, ‘of’ etc. which are often meaningless regarding context.

17
Q

What does the Bag-of-word technique give as output? What can this be used to do?

A

Text in the form of numerical vector representation. This is now computable for operations such as classification and clustering.

18
Q

What are some issues with bag-of-words?

A
  1. High dimensionality
  2. Not accounting for word importance. For example, we see long-tail distribution due to high count of common words.
  3. Scaling issues
19
Q

What model is built on improvement on Bag-of-words to account for word importance?

A

TF-IDF approach