Natural Language Processing Flashcards

1
Q

Pipeline of natural language processing

A

1) Text processing
2) Feature extraction
3) Modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Text processing

A
  • When reading html eliminate tags
  • Put all letters in lowercase
  • Sometimes it is a good idea to remove punctuation
  • Sometimes it is a good idea to remove words like “are, for, a, the”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Feature extraction

A
  • Letters in Unicode are represented by numbers, but they if they are treated as numbers that can affect the models
  • The are many ways to represent text info
  • If you want a graph based model to discover insights you want to represent words as node with relations
  • If you want to recognize spam or text sentiment use bag of words
  • Text generation or translation use word-to-vec
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Modeling

A
  • Create a statistical or machine learning model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to read a file in python

A
with open("hola.txt", "r") as f:
text = f.read()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to read tabular data or csv

A
  • You can use panda

- df = pd.read_csv(“hola.csv”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to get a website or a file in the web?

A
import requests
# Fetch a web page
r = requests.get("https://www.udacity.com/courses/all")
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to clean the text from a website?

A
  • Use a library
    from bs4 import BeautifulSoup
# Remove HTML tags using Beautiful Soup library
soup = BeautifulSoup(r.text, "html5lib")
print(soup.get_text())
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Tips for text cleaning

A
  • All letters to lower case

- In document classification or clustering eliminate punctuation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to eliminate punctuation

A
  • Use the regular expressions library “re”
  • import re
  • Replace punctuation with a space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Useful libraries

A
  • NLTK
  • BeautifulSoup
  • re
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a token

A
  • Is something that represents a unique concept like a dog
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Tokenization with NLTK

A
  • Token words with word_tokenize. It is smarter than split
  • Tokenize sentences with sent_tokenize
  • To tokenize twitters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Stop word removal

A
  • Eliminate words like “are”,”the” that don’t give extra info
  • nltk has a list of stop words
  • [word for word in querywords if word.lower() not in stopwords]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Part-of-speech tagging

A
  • It is helpful on some applications to classify words by verbs, nouns, etc
  • Use NLTK pos_tag
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Named entity recognition

A
  • Classify a noun by the type of entity: person, organization, government, etc
  • Use NLTK ne_chunk
17
Q

Stemming and Lemmatization

A
  • Methods to turn variations of a word to a stem or root word. Example: Started&raquo_space; Start
  • In stemming the result is not always a real word, but it is more efficient
  • NLTK has a stemming method PorterStemmer.stem()
  • lemmatization is more computationally consuming because it uses a dictionary, but the result word is a real word
  • NLTK has a lemmatization method WordNetLemmatizer.lemmatize() and it is setup by default to nouns
18
Q

Lesson summary

A

1) Normalize
2) Tokenize
3) Remove stop words
4) Stem / Lemmatize

19
Q

Bag of words

A
  • Interpret each document as a group of words without order
  • To compare how similar are two bag of words use dot product and cosine similarity
  • The document is a vector with the frequency of each word
  • To compare documents make a matrix, columns are word frequency and rows are the documents
  • Each word has the same importance
20
Q

TF-IDF

A
  • Highlight words that are more unique to a document

- tdidf = td*idf = count(d,t)/|d| * log(|D|/|dED:tEd|)

21
Q

Word2Vec

A
  • The idea is a model able to predict a word given neighboring words (Continuous Bag of Words) or given word predict the neighbors (Continuous Skipgram-model)
22
Q

GloVe

A
  • Use a co-occurrence probability matrix of words of a document.
  • P(Water | Ice) = 0.2
23
Q

language model

A

A language model captures the distributional statistics of words. In its most basic form, we take each unique word in a corpus, i, and count how many times it occurs.

24
Q

Bigram Model

A
  • Matrix of a corpus that tells the probability of a word to occurs given another previous word
  • It is used for generating text
25
Q

Regex eliminate punctuation

A
  • “\W” eliminate non letter characters