L15 - NLP Flashcards

Question 1

Q

Define Bayesian

Answer

A

Statistical method that assigns probabilities to events.

Question 2

Q

What is NLP?

Answer

A

Natural Language Processing is the subset of AI tasked with analysing and interpreting human language.

Question 3

Q

Give some examples of how NLP is used?

Answer

A

Text summation
Sentiment analysis
Topic extraction
Question answering
Spam detection

Question 4

Q

What is the bayesian spam detection formula?

Answer

A

P( message | spam ) = P(spam|message)P(spam ) / P(message)

Question 5

Q

How is NLP used for spam detection?

Answer

A

Historic data of both spam and non-spam emails is fed into some NLP model.

Creates a (message X word) matrix where each row is a message, each col is a word, and the value of the count of that word in the message.

From this matrix, classifications can be made regarding whether a message containing certain words is spam or not. This is the NLP classification model.

New message can be run through the model to determine whether it’s spam.

Question 6

Q

Explain how NLP can be used to determine whether a block of text is lyrics from a rock song…

Answer

A

Download a large set of rock song lyrics.
Break lyrics down into a large set of words.
Perform Stemming or Lemmatisation on each word.
Count all occurances of words
This can be used to predict whether other lyrics are a rock song or not.

Question 7

Q

Define Stemming…

Answer

A

Process of reducing a word to its root form by removing prefixes of suffixes.

Question 8

Q

Define Lemmatisation…

Answer

A

Reducing words to their lemma (base form), which is the contextual root of the word.

For example, the lemma of studies if study. Whereas the stem of studies is studi ( not a real world )

Question 9

Q

Why do we need stemming, lemmatisation or tokenisation?

Answer

A

Groups words or phrases with similar meanings, thus reducing the dimensional vector space of the models, and improves computability.

Question 10

Q

What is an issue with stemming that lemmatisation solves?

Answer

A

Stemming can lead to non-real worlds. E.g Studies -> Studi

In contrast, the lemma of studies is study.

Question 11

Q

What issue does Stemming and Lematisation solve?

Answer

A

High variation of similar words can lead to high dimensionality in the vector space. This degrades computation efficiency.

Thus, these reduce the dimensionality by grouping similar words to the same numeric value.

Question 12

Q

Define tokenisation…

Answer

A

Break a sentence or paragraph into individual words.

Question 13

Q

Give the Lemma of the following: was, changing, better, worse

Answer

A

be
change
good
bad

Question 14

Q

Why can tokenisation from english to another language cause issues?

Answer

A

Because other languages may not have the same sentence structure, and may have numerous to convey 1 word in english.

Question 15

Q

What is bag-of-words representation?

Answer

A

Vector based way of representing a text. Creates a Word Matrix containing all counts of each word in the text.

Each element int he vector represents the count of that word.

Columns contain all words in the text.

Each row is a sentence.

Can lead to sparse vectors.

Question 16

Q

What does Bag-o-words usually result in?

Answer

Study These Flashcards

A

Long tail distribution.

Common words have high count, less common have low count.

Common words such as ‘the’, ‘a’, ‘of’ etc. which are often meaningless regarding context.

Question 17

Q

What does the Bag-of-word technique give as output? What can this be used to do?

Answer

Study These Flashcards

A

Text in the form of numerical vector representation. This is now computable for operations such as classification and clustering.

Question 18

Q

What are some issues with bag-of-words?

Answer

Study These Flashcards

A

High dimensionality
Not accounting for word importance. For example, we see long-tail distribution due to high count of common words.
Scaling issues

Question 19

Q

What model is built on improvement on Bag-of-words to account for word importance?

Answer

Study These Flashcards

A

TF-IDF approach

L15 - NLP Flashcards

(19 cards)