Natural Language Processing Flashcards

Question 1

Q

Pipeline of natural language processing

Answer

A

1) Text processing
2) Feature extraction
3) Modeling

Question 2

Q

Text processing

Answer

A

When reading html eliminate tags
Put all letters in lowercase
Sometimes it is a good idea to remove punctuation
Sometimes it is a good idea to remove words like “are, for, a, the”

Question 3

Q

Feature extraction

Answer

A

Letters in Unicode are represented by numbers, but they if they are treated as numbers that can affect the models
The are many ways to represent text info
If you want a graph based model to discover insights you want to represent words as node with relations
If you want to recognize spam or text sentiment use bag of words
Text generation or translation use word-to-vec

Question 4

Q

Modeling

Answer

A

Create a statistical or machine learning model

Question 5

Q

How to read a file in python

Answer

A

with open("hola.txt", "r") as f:
text = f.read()

Question 6

Q

How to read tabular data or csv

Answer

A

You can use panda

- df = pd.read_csv(“hola.csv”)

Question 7

Q

How to get a website or a file in the web?

Answer

A

import requests
# Fetch a web page
r = requests.get("https://www.udacity.com/courses/all")

Question 8

Q

How to clean the text from a website?

Answer

A

Use a library
from bs4 import BeautifulSoup

# Remove HTML tags using Beautiful Soup library
soup = BeautifulSoup(r.text, "html5lib")
print(soup.get_text())

Question 9

Q

Tips for text cleaning

Answer

A

All letters to lower case

- In document classification or clustering eliminate punctuation

Question 10

Q

How to eliminate punctuation

Answer

A

Use the regular expressions library “re”
import re
Replace punctuation with a space

Question 11

Q

Useful libraries

Answer

A

NLTK
BeautifulSoup
re

Question 12

Q

What is a token

Answer

A

Is something that represents a unique concept like a dog

Question 13

Q

Tokenization with NLTK

Answer

A

Token words with word_tokenize. It is smarter than split
Tokenize sentences with sent_tokenize
To tokenize twitters

Question 14

Q

Stop word removal

Answer

A

Eliminate words like “are”,”the” that don’t give extra info
nltk has a list of stop words
[word for word in querywords if word.lower() not in stopwords]

Question 15

Q

Part-of-speech tagging

Answer

A

It is helpful on some applications to classify words by verbs, nouns, etc
Use NLTK pos_tag

Question 16

Q

Named entity recognition

Answer

A

Classify a noun by the type of entity: person, organization, government, etc
Use NLTK ne_chunk

Question 17

Q

Stemming and Lemmatization

Answer

A

Methods to turn variations of a word to a stem or root word. Example: Started&raquo_space; Start
In stemming the result is not always a real word, but it is more efficient
NLTK has a stemming method PorterStemmer.stem()
lemmatization is more computationally consuming because it uses a dictionary, but the result word is a real word
NLTK has a lemmatization method WordNetLemmatizer.lemmatize() and it is setup by default to nouns

Question 18

Q

Lesson summary

Answer

A

1) Normalize
2) Tokenize
3) Remove stop words
4) Stem / Lemmatize

Question 19

Q

Bag of words

Answer

A

Interpret each document as a group of words without order
To compare how similar are two bag of words use dot product and cosine similarity
The document is a vector with the frequency of each word
To compare documents make a matrix, columns are word frequency and rows are the documents
Each word has the same importance

Question 20

Q

TF-IDF

Answer

A

Highlight words that are more unique to a document

- tdidf = td*idf = count(d,t)/|d| * log(|D|/|dED:tEd|)

Question 21

Q

Word2Vec

Answer

A

The idea is a model able to predict a word given neighboring words (Continuous Bag of Words) or given word predict the neighbors (Continuous Skipgram-model)

Question 22

Q

GloVe

Answer

A

Use a co-occurrence probability matrix of words of a document.
P(Water | Ice) = 0.2

Question 23

Q

language model

Answer

A

A language model captures the distributional statistics of words. In its most basic form, we take each unique word in a corpus, i, and count how many times it occurs.

Question 24

Q

Bigram Model

Answer

A

Matrix of a corpus that tells the probability of a word to occurs given another previous word
It is used for generating text

Question 25

Q

Regex eliminate punctuation

Answer

A

“\W” eliminate non letter characters

Brainscape's Knowledge GenomeTM

Natural Language Processing Flashcards

Brainscape's Knowledge Genome^TM