Natural Language Processing Flashcards
(32 cards)
Text Processing
Deals with various algorithms and techniques to retrieve valuable information from the documents or texts in a natural language.
What are the applications of Text Processing?
6
1) Social Media Text Mining
2) Text Classification
3) Recommender Analysis
4) Conversational Agents
5) Document Summarization
6) Semantic web
What is the difference between structured and unstructured data?
unstructured data = data that lacks a prescribed semantical structure
structured data = database
Vector Space
Represents documents as features and weights in an n-dimensional vector space
Term Frequency (TF)
The number of terms in a document
Inverse Document Frequency (IDF)
The number of documents that were counted in the searched collection.
idf(t,D) = log2(N/(df)) = log2(total number of documents/term frequency in the documents)
Suppose you have 3 documents. D1 = midterm 1 marks, D2 = midterm 2 marks and D3 = total. The term frequency in this example would be:
D1 | 1 | 1 | 1 | 0 | 0 |
| D2 | 1 | 0 | 1 | 1 | 0 |
| D3 | 0 | 0 | 0 | 0 | 1 |
| Term Doc Freq | 2 | 1 | 2 | 1 | 1 |
IDF Calculations
N = total number of documents
df = frequency of the term in the document
IDF(mid) = IDF(marks) = log2(3/2) = 0.585
IDF(term1) = log2(3/1) = 1.585
TF.IDF
TF.IDF(D1, mid) = TF.IDF(D2, mid) = TF.IDF(D1, marks) = TF.IDF(D2, marks) =1 * 0.585 = 0.585
TF.IDF(D1, term1) = TF.IDF(D2, term2) = TF.IDF(D3, total) = 1*1.585 = 1.585
Precision
Number of documents retrieved by a search / total number of documents retrieved by a search (measurement of relevancy)
TP/(TP+FP)
Recall
Relevant # of documents retrieved by a search / total number of existing relevant works (quality of relevance)
TP/(TP + FN)
F-Score
2(PrecisionRecall)/(Precision + Recall)
Word Embeddings
Vector of numbers representing a specific word.
Ex: TF-IDF word embedding would be based on frequency of a word
Give an example of how word embeddings works for the sentences “I like apples” and “I love apples”.
1) Let the vocabulary be X = {I, like, love, apples}
2) Create a vectors for each word. I = [1,0,0,0]’ ; like = [0,1,0,0]’ ; love = [0,0,1,0]’ ; apples = [0,0,0,1]’
The method above uses hot encoding but is incorrect since each word is considered different (like and love are not however).
Similar words should be close spatially.
What are the Word2Vec methods?
2
1) Common Bag of Words (CBOW)
2) Skip Gram
Explain the CBOW method.
CBOW (Continuous Bag of Words)
Goal: Predict the target word based on its context words.
Input: Context words (surrounding words).
Output: The target (center) word.
Example: Given the sentence: “The cat sits on the mat”
For the word “sits”, with window size 2, context = [“The”, “cat”, “on”, “the”]
CBOW tries to predict “sits” using those context words.
Explain the skip gram method.
Skip-gram
Goal: Predict the context words from a target word.
Input: The target (center) word.
Output: Context words (surrounding words).
Example: Given the same sentence:
For the word “sits”, with window size 2, output = [“The”, “cat”, “on”, “the”]
Skip-gram uses “sits” to predict each of those surrounding words.
String matching
process of comparing two or more words using a spcecific set of rules
What are the applications of string matching?
4
1) Search engines
2) Spell Checkers
3) Profile matches
4) DNA sequence comparison
What are the types of string matching algorithms?
2
1) exact string matching Jennifer ≠ Jenifer - given a pair of strings every character of the first string should match every character of the next string
2) Approximate string matching Jennifer = Jenifer - given a pair of strings two strings should be similar enough
What are the string matching algorithms?
4
1) Phonetic
2) Token-based
3) Pattern-matching
4) Hybrid
Token-based string matching
A type of string matching that considers tokens not characters so “John Smith” = “Smith John” since they both have John and Smith
What are the different token-based string matching?
6
1) Jaccard Similarity
2) K-Shingling
3) Minhashing
4) Hamming Distance
5) Cohen TF.IDF
6) Monge and Elkan’s Atomic String
Explain the Jaccard Similarity method of token-based string matching.
The Jaccard similarity is a simple and intuitive way to measure the similarity between two sets. In the context of token matching, it compares the sets of tokens (e.g., words, characters, or other elements) from two text strings to see how much they overlap.
Formula:
JaccardSimilarity = ∣𝐴∩𝐵∣ / ∣𝐴∪𝐵∣
Where:
A and B are sets of tokens from the two strings.
∣A∩B∣ is the number of tokens common to both sets.
∣A∪B∣ is the total number of unique tokens in both sets.
Example:
Let’s say we have two sentences:
String 1: “apple banana mango”
String 2: “banana mango peach”
Tokens:
Set A = {“apple”, “banana”, “mango”}
Set B = {“banana”, “mango”, “peach”}
Intersection: {“banana”, “mango”} → size = 2
Union: {“apple”, “banana”, “mango”, “peach”} → size = 4
Jaccard Similarity = 2 / 4 = 0.5
Explain the K-Shingles method of token-based string matching.
K-shingle of a document at hand is said to be all the possible consecutive sub-string of length k found in it.
Ex: In the string “abcdabd” with a k=2 then the set of shingles is D = {ab, bc, cd, da, bd}
Often we using this method we want to remove stop words like “and”, “you” and “to”.
Explain the Minhashing method of token-based string matching.
Suppose that the universal set is {m,n,o,p,q}.
Here S1 = {n,o,p}, S2 = {m,p}, S3 = {n,o,q} and S4 = {m,n,q}
|row| S1 | S2 | S3 | S4 |(x+3) mod 5 | (2x+1) mod 5 |
| 0 | 0 | 1 | 0 | 1 | 3 | 1 |
| 1 | 1 | 0 | 1 | 1 | 4 | 3 |
| 2 | 1 | 0 | 1 | 0 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 | 1 | 2 |
| 4 | 1 | 0 | 1 | 1 | 2 | 4 |
h1(x) = (x+3) mod 5
h2(x) = (2x+1) mod 5
Signature Matrix
Step 1: Initially all infinity
| | S1 | S2 | S3 | S4 |
| h1 | inf | inf | inf | inf |
| h2 | inf | inf | inf | inf |
Step 2: Start with row 1. In this row S2 and S4 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1 | S2 | S3 | S4 |
| h1 | inf | 3 | inf | 3 |
| h2 | inf | 1 | inf | 1 |
Step 3: Start with row 2. In this row S1 and S3 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 4 | 3 | 4 | 3 |
| h2 | 3 | 1 | 3 | 1 |
Step 4: Start with row 3. In this row S1 and S3 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 0 | 3 | 0 | 3 |
| h2 | 0 | 1 | 0 | 1 |
Step 5: Start with row 4. In this row S1, S3 and S4 have values of 1. So replace with hash for h1 and h2 since less than infinity.
| | S1| S2 | S3 | S4 |
| h1 | 0 | 2 | 0 | 2 |
| h2 | 0 | 1 | 0 | 1 |
Element | S1 | S2 | S3 | S4 |