NLP _ 01 Flashcards

(92 cards)

1
Q

What is the most likely the first step of NLP?

cutting board

A

Text preprocessing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

What is the most likely the first step of NLP?

cutting board

A

Text preprocessing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is Noise removal?

front of fridge

A

stripping text of formatting.(e.g. HTML tags.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Tokenization?

under sink door

A

breaking text into individual words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is normalization?

A

Cleaning text data in any other way than Noise removal and tokenization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is stemming?

A

it is a blunt axt to shop off word prefexes ans suffixes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is lemmatization?

coat closet

A

It is a scalpel to bring words down to their root forms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What would I import to use regex?

A

import re

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What python package could I use for NLP?

A

import nltk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what method of nltk would I use to tokenize text?

A

from nltk.tokenize import word_tokenize

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Give an example of a list comprehension :

A

lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How would you import WordNetLemmatizer?

A

from nltk.stem import WordNetLemmatizer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How would you import PorterStemmer?

A

from nltk.stem import PorterStemmer?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

By default lemmatize() treat every word as a…?

A

Noun

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Language models are probabilistic machine models of …?

A

language used for NLP comprehension tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Language models learn a …?

A

probability of word occurrence over a sequence of words and use it to estimate the relative likelihood of different phrases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Common language models include:

A

Statistical models:
- bag of words (unigram model)
- n-gram models
Neural Language Modeling(NLM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Text simlarity in NLP?

A

Text similarity is a facet of NLP concerned with the similarity between texts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are two popular text similarity metrics?

A
  • Levenshtein distance
  • cosine similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How would you describe the metric : Levenshtein distance

A

it is defined as the minimum number of edit operations( deletions,insertions, or substitutions) required to transform a text into another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Define the metric : Cosine similarity

A

It is defined as teh cosine of the angle between two vectors. To determine the cosine similarity, text documents need to be converted into vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

**What are common forms of language prediction?

A
  • **Auto-suggest **and suggested replies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Natural Language processing is concerned with …?

A

enabling computers to interpret, analyze, and **approximate **the generation of human speech.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Parsing w.r.t NLP?

A

it is the process concerned with segmenting text based on syntax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
What is Part-Of-Speech tagging
It identifies parts of speech(**verbs**, **nouns**, **adjectives**, etc..)
25
It helps computers understand the relationship between the words in a sentence?
A Dependacy grammar tree
26
What does a Dependency grammar tree help you understand?
The relationship between the words in a sentence.
27
What does NER stand for?
Named entity recognition
28
What does NER help identify?
Proper Nouns (e.g., "Natalia" , or "Berlin" ) in a test. This can be a clue to figure out the topic of the text.
29
When you have ____ coupled with POS tagging you can idenfity specific phrase chuncks
Regex parsing
30
When you couple Regex parsing and POS tagging you can...?
identify the specific phrase chucks.
31
A very common unigram model, a statictical language model commonly known as ..? | front door
The Bag-Of-Words
32
Bag-of-Words can be an excellent way of looking at lanuage when you want to make predicitons concerning....?
the topic or sentiment of a test | When grammer and word order are irrelevant, this is a good mode
33
what would I import to get word counts for the bag of words model?
from `collections` import `Counter`
34
how would I import a part-of-speach function for lemmatization?
`from part_of_speach import get_part_of_speech`
35
For parsing entire phrases or conducting language prediction , you will want a model that .....?
pays attention to each word's neighbors.
36
Unlike bag-of-words, the **n-gram model** considers a ....?
....sequence of some **number (n)** units and calculates the probability of each unit in a body of language given the preceding sequence of **length n**. | Because of this, n-gram probabilities with larger n values can ## Footnote be impressive at language prediction.
37
What tactic can help with adjusting probabilities for unkown words but it is not always ideal?
Language smoothing
38
What is **Language smoothing**?
a tactic that can help adust probabilities for unknown words, *but it isn't always ideal*
39
For a model that more accurately predicts human language patterns, you want `n` (your sequence length) ...?
....to be as **large** at **possible.**
40
What happens if you make your **n-grams** to long?
The number of examples to train off of shrinks and you won't have enough to train on.
41
What the common **Neural langauge models (NLMs)** ?
1. LSTMs 2. Transformer models
42
What is **Topic Modeling**?
It is an area of NLP dedicated to uncovering latent, or hidden , topics within a body of language.
43
A common * technique* is to *deprioritize* the most common words and prioritize less frequently used terms as topics in a process known as ...?
...**term frequency-inverse document frequency** (tf-idf)
44
What libraries in Python have modules to handle **tf-idf**?
gensim and sklearn
45
What is **LDA or Latent Dirichlet allocation**?
LDA is a statistical model that takes your documents and determines which words keep popping up together in the same contexts(i.e. documents)
46
What is word embedding?
The process of word-to-vector mapping
47
**word-to-vector** mapping is also called?
word embedding
48
If I would like to **visualize** the topics model results . You could use...?
**word2vec**: * it is a great technique that can map out your topic model results spatially as vectors so that similary usded words are closer together.
49
How is the **Levenshtein ditance** calculated?
the distance calculated through the minimum number of insertions, deletions, and substitutions that would need to occur for one word to become another.
50
Define: **Levenshtein distance**
the minimal edit ditance between two words.
51
What is Phonetic silimarity?
how much words or phrases **sound** the *same.*
52
Define: **Lexical Similarity** | window over kitchen sink
the degree to which texts use the same vocabulary and phrases
53
Define: **Semantic similarity**
the degree to which documents contain similar meaning or topics
54
Addressing ________ _________ - including spelling correction - is a major challenge within natural language processing
**Text similarity**
55
What is it called when documents/text contain similar meaning or topics?
**Semantic similarity**
56
What is called when documents/texts share the same degree to which texts use the same vocabulary and phrases | Window over kitchen sink
**Lexical similarity**
57
How would I import a tool to measure the Levenshtein distance?
`from nltk.metrics import edit_distance`
58
what python module has a built-in function to check the levenshtein distance?
nltk
59
What is the application of NLP concerned with predicting test given preceding text?
Language prediction
60
What is the first step to language prediction?
It is picking your langauge model
61
**Bag of words** alone is generally ...?
not a great model for langauge prediction.
62
w.r.t Langauge prediction if you go with the n-gram route, you will most likely pick what model? | Magnetic knife holder
**Markov chains**
63
Define the Lanuage Model: **Markov chains** | Magnetic knife holder
the model the predicts the statistical likelihood of each following word(or character) based on the training corpus. Markov chains are memory-less and make statistica predictions based entierly on the current n-gram on hand.
64
What is a supervised machine learning algorithm that leverages a **probabilistic** theorem to make predictions and classifications?
Naive Bays Classifiers
65
# Define : sentiment analysis
determing whether a given block of lanuage expresses negative or postive feelings.
66
Text preproccessing is a stage of ....?
NLP focused on cleaning and preparing text for other NLP tasks
67
Parsing is an ....?
**NLP technique** concerned with breaking up text based on syntax
68
What are two python libraries that can handle syntax parsing?
gensim & sklearn
69
What is are common text preprocessing steps
70
Tokenization will... ?
break multi-word strings into smaller components
71
Normalization is a ....?
catch-all term for processing data. this includes stemming and lemmatization
72
Noise removal is when we...?
remove unnecessary charaters and formating
73
Stemming is....?
text preprocessing nomalization task concerned with bluntly removing word affixes(prefixes and suffixes)
74
Lemmatization is a ....? | Coat closet
text preprocessing nomalization task concerned with bring words down to thier root forms. ## Footnote https://www.codecademy.com/learn/paths/data-science-nlp/tracks/dsnlp-text-preprocessing/modules/nlp-text-preprocessing/cheatsheet
75
Stopword Removal is the process of ....?
removing words from a string that don't provide any information about the tone of a statement. ## Footnote https://www.codecademy.com/learn/paths/data-science-nlp/tracks/dsnlp-text-preprocessing/modules/nlp-text-preprocessing/cheatsheet
76
Using part-of-speech can ...?
improve the results of lemmatization
77
What are two common Python libraries used in text preprocessing?
NLTK and re
78
`_________` is a technique that devolopers use in a variety of domains
**Text cleaning**
79
When you are **text cleaning** you may want to remove unwanted info such as: 1. ` ______??________` 2. Special Characters 3. Numeric digits 4. Leading, ending, and veritcal whitespace 5. HTML formatting
Punctuation and accents
80
When you are **text cleaning** you may want to remove unwanted info such as: 1. Punctuation and accents 2. ` ______??________` 3. Numeric digits 4. Leading, ending, and veritcal whitespace 5. HTML formatting
Special Characters
81
When you are **text cleaning** you may want to remove unwanted info such as: 1. Punctuation and accents 2. Special Characters 3. ---??--------- 4. Leading, ending, and veritcal whitespace 5. HTML formatting
Numeric digits
82
When you are **text cleaning** you may want to remove unwanted info such as: 1. Punctuation and accents 2. Special Characters 3. Numerica Digits 4.----??------ 5. HTML formatting
Leading, ending, and veritcal whitespace
83
When you are **text cleaning** you may want to *remove unwanted info* such as: 1. Punctuation and accents 2. Special Characters 3. Numerica Digits 4. Leading , ending , and vertical whitespace 5.---???---
HTML formatting
84
The type of noise you need to remove from text usually depends on the ....?
source | marketing journal vs a medical journal
85
You can use the `_____ ` method in Python's regular expression library for most of your noise removal needs.
`.sub()`
86
The `.sub()` method has three required arguments: 1. ---?--- 2. `replacement_text` – text that replaces all matches in the input string 3. `input` – the input string that will be edited by the .sub() method | Top of Fridge
`pattern` – a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters.
87
The `.sub()` method has three required arguments: 1. `pattern` – a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters. 2. ---?--- 3. `input` – the input string that will be edited by the .sub() method | Top of fridge - ingredients
`replacement_text` – text that replaces all matches in the input string
88
The `.sub()` method has three required arguments: 1. `pattern` – a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters. 2. `replacement_text` – text that replaces all matches in the input string 3. ---?--- | top of fridge , ingrediants
input` – the input string that will be edited by the `.sub()` method
89
The method `.sub()` returns a ....?
a `string` with all instances of the `pattern` replaced by the `replacement_text`.
90
How could you remove the HTML tag `

` from a string?

``import re ` `text = "

This is a paragraph

" result = re.sub(r'<.?p>', '', text) print(result) ` This is a paragraph
91
What is a common practice to replace HTML tags with..?
empty string `''`