NLP _ 01 Flashcards by michael stroud

What is the most likely the first step of NLP?

cutting board

Text preprocessing

How well did you know this?

Not at all

Perfectly

What is the most likely the first step of NLP?

cutting board

Text preprocessing

How well did you know this?

Not at all

Perfectly

what is Noise removal?

front of fridge

stripping text of formatting.(e.g. HTML tags.)

How well did you know this?

Not at all

Perfectly

What is Tokenization?

under sink door

breaking text into individual words

How well did you know this?

Not at all

Perfectly

What is normalization?

Cleaning text data in any other way than Noise removal and tokenization

How well did you know this?

Not at all

Perfectly

What is stemming?

it is a blunt axt to shop off word prefexes ans suffixes.

How well did you know this?

Not at all

Perfectly

What is lemmatization?

coat closet

It is a scalpel to bring words down to their root forms

How well did you know this?

Not at all

Perfectly

What would I import to use regex?

import re

How well did you know this?

Not at all

Perfectly

What python package could I use for NLP?

import nltk

How well did you know this?

Not at all

Perfectly

what method of nltk would I use to tokenize text?

from nltk.tokenize import word_tokenize

How well did you know this?

Not at all

Perfectly

Give an example of a list comprehension :

lemmatized = [lemmatizer.lemmatize(token) for token in tokenized]

How well did you know this?

Not at all

Perfectly

How would you import WordNetLemmatizer?

from nltk.stem import WordNetLemmatizer

How well did you know this?

Not at all

Perfectly

How would you import PorterStemmer?

from nltk.stem import PorterStemmer?

How well did you know this?

Not at all

Perfectly

By default lemmatize() treat every word as a…?

Noun

How well did you know this?

Not at all

Perfectly

Language models are probabilistic machine models of …?

language used for NLP comprehension tasks

How well did you know this?

Not at all

Perfectly

Language models learn a …?

probability of word occurrence over a sequence of words and use it to estimate the relative likelihood of different phrases.

How well did you know this?

Not at all

Perfectly

Common language models include:

Statistical models:
- bag of words (unigram model)
- n-gram models
Neural Language Modeling(NLM)

How well did you know this?

Not at all

Perfectly

What is Text simlarity in NLP?

Text similarity is a facet of NLP concerned with the similarity between texts.

How well did you know this?

Not at all

Perfectly

What are two popular text similarity metrics?

Levenshtein distance
cosine similarity

How well did you know this?

Not at all

Perfectly

How would you describe the metric : Levenshtein distance

it is defined as the minimum number of edit operations( deletions,insertions, or substitutions) required to transform a text into another

How well did you know this?

Not at all

Perfectly

Define the metric : Cosine similarity

It is defined as teh cosine of the angle between two vectors. To determine the cosine similarity, text documents need to be converted into vectors.

How well did you know this?

Not at all

Perfectly

**What are common forms of language prediction?

**Auto-suggest **and suggested replies

How well did you know this?

Not at all

Perfectly

Natural Language processing is concerned with …?

enabling computers to interpret, analyze, and **approximate **the generation of human speech.

How well did you know this?

Not at all

Perfectly

What is Parsing w.r.t NLP?

it is the process concerned with segmenting text based on syntax

How well did you know this?

Not at all

Perfectly

What is Part-Of-Speech tagging

It identifies parts of speech(**verbs**, **nouns**, **adjectives**, etc..)

It helps computers understand the relationship between the words in a sentence?

A Dependacy grammar tree

What does a Dependency grammar tree help you understand?

The relationship between the words in a sentence.

What does NER stand for?

Named entity recognition

What does NER help identify?

Proper Nouns (e.g., "Natalia" , or "Berlin" ) in a test. This can be a clue to figure out the topic of the text.

When you have ____ coupled with POS tagging you can idenfity specific phrase chuncks

Regex parsing

When you couple Regex parsing and POS tagging you can...?

identify the specific phrase chucks.

A very common unigram model, a statictical language model commonly known as ..? | front door

The Bag-Of-Words

Bag-of-Words can be an excellent way of looking at lanuage when you want to make predicitons concerning....?

the topic or sentiment of a test | When grammer and word order are irrelevant, this is a good mode

what would I import to get word counts for the bag of words model?

from `collections` import `Counter`

how would I import a part-of-speach function for lemmatization?

`from part_of_speach import get_part_of_speech`

For parsing entire phrases or conducting language prediction , you will want a model that .....?

pays attention to each word's neighbors.

Unlike bag-of-words, the **n-gram model** considers a ....?

....sequence of some **number (n)** units and calculates the probability of each unit in a body of language given the preceding sequence of **length n**. | Because of this, n-gram probabilities with larger n values can ## Footnote be impressive at language prediction.

What tactic can help with adjusting probabilities for unkown words but it is not always ideal?

Language smoothing

What is **Language smoothing**?

a tactic that can help adust probabilities for unknown words, *but it isn't always ideal*

For a model that more accurately predicts human language patterns, you want `n` (your sequence length) ...?

....to be as **large** at **possible.**

What happens if you make your **n-grams** to long?

The number of examples to train off of shrinks and you won't have enough to train on.

What the common **Neural langauge models (NLMs)** ?

1. LSTMs 2. Transformer models

What is **Topic Modeling**?

It is an area of NLP dedicated to uncovering latent, or hidden , topics within a body of language.

A common * technique* is to *deprioritize* the most common words and prioritize less frequently used terms as topics in a process known as ...?

...**term frequency-inverse document frequency** (tf-idf)

What libraries in Python have modules to handle **tf-idf**?

gensim and sklearn

What is **LDA or Latent Dirichlet allocation**?

LDA is a statistical model that takes your documents and determines which words keep popping up together in the same contexts(i.e. documents)

What is word embedding?

The process of word-to-vector mapping

**word-to-vector** mapping is also called?

word embedding

If I would like to **visualize** the topics model results . You could use...?

**word2vec**: * it is a great technique that can map out your topic model results spatially as vectors so that similary usded words are closer together.

How is the **Levenshtein ditance** calculated?

the distance calculated through the minimum number of insertions, deletions, and substitutions that would need to occur for one word to become another.

Define: **Levenshtein distance**

the minimal edit ditance between two words.

What is Phonetic silimarity?

how much words or phrases **sound** the *same.*

Define: **Lexical Similarity** | window over kitchen sink

the degree to which texts use the same vocabulary and phrases

Define: **Semantic similarity**

the degree to which documents contain similar meaning or topics

Addressing ________ _________ - including spelling correction - is a major challenge within natural language processing

**Text similarity**

What is it called when documents/text contain similar meaning or topics?

**Semantic similarity**

What is called when documents/texts share the same degree to which texts use the same vocabulary and phrases | Window over kitchen sink

**Lexical similarity**

How would I import a tool to measure the Levenshtein distance?

`from nltk.metrics import edit_distance`

what python module has a built-in function to check the levenshtein distance?

nltk

What is the application of NLP concerned with predicting test given preceding text?

Language prediction

What is the first step to language prediction?

It is picking your langauge model

**Bag of words** alone is generally ...?

not a great model for langauge prediction.

w.r.t Langauge prediction if you go with the n-gram route, you will most likely pick what model? | Magnetic knife holder

**Markov chains**

Define the Lanuage Model: **Markov chains** | Magnetic knife holder

the model the predicts the statistical likelihood of each following word(or character) based on the training corpus. Markov chains are memory-less and make statistica predictions based entierly on the current n-gram on hand.

What is a supervised machine learning algorithm that leverages a **probabilistic** theorem to make predictions and classifications?

Naive Bays Classifiers

# Define : sentiment analysis

determing whether a given block of lanuage expresses negative or postive feelings.

Text preproccessing is a stage of ....?

NLP focused on cleaning and preparing text for other NLP tasks

Parsing is an ....?

**NLP technique** concerned with breaking up text based on syntax

What are two python libraries that can handle syntax parsing?

gensim & sklearn

What is are common text preprocessing steps

Tokenization will... ?

break multi-word strings into smaller components

Normalization is a ....?

catch-all term for processing data. this includes stemming and lemmatization

Noise removal is when we...?

remove unnecessary charaters and formating

Stemming is....?

text preprocessing nomalization task concerned with bluntly removing word affixes(prefixes and suffixes)

Lemmatization is a ....? | Coat closet

text preprocessing nomalization task concerned with bring words down to thier root forms. ## Footnote https://www.codecademy.com/learn/paths/data-science-nlp/tracks/dsnlp-text-preprocessing/modules/nlp-text-preprocessing/cheatsheet

Stopword Removal is the process of ....?

removing words from a string that don't provide any information about the tone of a statement. ## Footnote https://www.codecademy.com/learn/paths/data-science-nlp/tracks/dsnlp-text-preprocessing/modules/nlp-text-preprocessing/cheatsheet

Using part-of-speech can ...?

improve the results of lemmatization

What are two common Python libraries used in text preprocessing?

NLTK and re

`_________` is a technique that devolopers use in a variety of domains

**Text cleaning**

When you are **text cleaning** you may want to remove unwanted info such as: 1. ` ______??________` 2. Special Characters 3. Numeric digits 4. Leading, ending, and veritcal whitespace 5. HTML formatting

Punctuation and accents

When you are **text cleaning** you may want to remove unwanted info such as: 1. Punctuation and accents 2. ` ______??________` 3. Numeric digits 4. Leading, ending, and veritcal whitespace 5. HTML formatting

Special Characters

When you are **text cleaning** you may want to remove unwanted info such as: 1. Punctuation and accents 2. Special Characters 3. ---??--------- 4. Leading, ending, and veritcal whitespace 5. HTML formatting

Numeric digits

When you are **text cleaning** you may want to remove unwanted info such as: 1. Punctuation and accents 2. Special Characters 3. Numerica Digits 4.----??------ 5. HTML formatting

Leading, ending, and veritcal whitespace

When you are **text cleaning** you may want to *remove unwanted info* such as: 1. Punctuation and accents 2. Special Characters 3. Numerica Digits 4. Leading , ending , and vertical whitespace 5.---???---

HTML formatting

The type of noise you need to remove from text usually depends on the ....?

source | marketing journal vs a medical journal

You can use the `_____ ` method in Python's regular expression library for most of your noise removal needs.

`.sub()`

The `.sub()` method has three required arguments: 1. ---?--- 2. `replacement_text` – text that replaces all matches in the input string 3. `input` – the input string that will be edited by the .sub() method | Top of Fridge

`pattern` – a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters.

The `.sub()` method has three required arguments: 1. `pattern` – a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters. 2. ---?--- 3. `input` – the input string that will be edited by the .sub() method | Top of fridge - ingredients

`replacement_text` – text that replaces all matches in the input string

The `.sub()` method has three required arguments: 1. `pattern` – a regular expression that is searched for in the input string. There must be an r preceding the string to indicate it is a raw string, which treats backslashes as literal characters. 2. `replacement_text` – text that replaces all matches in the input string 3. ---?--- | top of fridge , ingrediants

input` – the input string that will be edited by the `.sub()` method

The method `.sub()` returns a ....?

a `string` with all instances of the `pattern` replaced by the `replacement_text`.

How could you remove the HTML tag `

` from a string?

``import re ` `text = "

This is a paragraph

" result = re.sub(r'<.?p>', '', text) print(result) ` This is a paragraph

What is a common practice to replace HTML tags with..?

empty string `''`

NLP _ 01 Flashcards

(92 cards)