NLP Concepts Flashcards

1
Q

What is NLP?

A

NLP, is teaching computers to understand and communicate with us in the same way we communicate with each other.

Typical apps include, language translators, chatbots, assistants like Siri, Alexa, sentiment analysis, text generation, classification, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is an NLP pipeline?

A

It is a series of processing tasks used to transform raw text data into a structured format suitable for ML.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name the steps in a NLP pipeline

A

Data acquisition
Text pre-processing
Feature extraction
Modeling
Evaluation
Deployment
Monitoring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is tokenization?

A

It is the process of breaking down a sentence into words or words into sub-words or characters.

These are called tokens. They are the building blocks of NLP models.

They also help build a vocabulary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why do we tokenize the data in a NLP pipeline?

A

Token - building block
Convert unstructured text to structured format.
Makes pre-processing easier
Each token can be a feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is stemming?

A

It is the process of reducing a word to its root word. The root may or many not be part of the vocabulary in that language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is lemmatization?

A

It is a smarter version of stemming. The root word that is derived is a part of the language vocabulary.

Ex: ate, eaten, eating have a root (lemma) - eat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a corpus?

A

It is the entire body of documents we use in the context of our NLP app.

In a bigger context, it is a collection of texts or writings that scientists use to understand how language works.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does a count vectorizer do?

A

NLP technique used to transform a collection of vectors into a feature matrix.
Used to implement BoW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Bag of Words (BOW)

A

NLP technique for text analysis such as text classification, sentiment analysis, etc.

Unordered collection of words.
Does not retain semantic meaning and word order.
Retains word frequency of each document.
Characterized by sparse vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is TF-IDF?

A

It is a statistic used to evaluate the importance of a term (word or phrase) within a document relative to a collection of documents, or corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is TF calculated?

A

Number of times a term occurs in a document / # of terms in a document.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is IDF calculated?

A

log (# of documents in a corpus / # of documents in which the term appears).

TF-IDF is the product of TF and IDF.

log is to the base e.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are word embeddings?

A

Dense vector representations of words in a high-dimensional vector space.

Designed to capture the semantic meaning and relationships between words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Word2Vec?

A

It is a method for transforming words into numerical vectors that capture the meaning and context of words based on their co-occurrence in large text datasets.

Generates word embeddings - converts words into vectors using neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a continuous bag of words (CBOW)

A

It predicts a target word based on the surrounding context words.

17
Q

What is skip gram?

A

It predicts the context words around the target words. It is the opposite of CBOW.

18
Q

What is N-gram?

A

It is a contiguous sequence of N items (typically words), used for various NLP tasks, like language modeling or text analysis.

They are useful for understanding the relationships and patterns in sequential data, such as predicting the next word in a sentence.

19
Q

What are some of things we do in pre-processing?

A

It is a data cleaning step. It involves lower casing, removing URLs, HTML tags, stop words, punctuations, stemming lemmatization.

This is done on a case by case basis

20
Q

Why do we pre-process the text?

A

The idea is to retain only the tokens that retain the crux of the document so that the vocabulary size is reduced thereby reducing the # of features in the dataset.

21
Q

What is a transformer?

A

Deep learning architecture characterized by the self-attention mechanism.
It captures contextual information.
Used in language translation, sentiment analysis, text generation.