Module 9 Flashcards

1
Q

What is NLP?

A

produces machine-driven analyses of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is NLP a hard problem

A

Language is ambiguous, multiple people may interpret it differently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Applications of NLP (amn)

A
  • automatic summarization
  • machine translation
  • named entity recognition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is corpus

A

collection of written texts that serve as a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are token and tokenization

A

a string of contiguous characters between two spaces can be an integer, real, number with a colon

converting text to tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is text preprocessing + 3 steps

A
data is not analyzable without pre-processing
steps
- Noise removal
- Lexicon normalization
- object standardization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is noise removal?

A

removal of all noisy entities in text, not relevant to data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what are stopwords

A

is, am common words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a general approach to noise removal?

A
  • prepare a dictionary of noisy entities and iterate text object by words to eliminate those existing in both
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is lexicon normalization

A

converts all disparities of the word to normal form
converts high dimensionality to low dimensionality
player, played -> play

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are the most common normalization practices

A

Stemming and lemmatization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is lemmatization

A

gets root of the word -> dictionary headword form
am are is -> be
car cars car’s -> car

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are morphemes

A

small meaningful units that makeup words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is stemming

A

stemming is a rudimentary rule-based process to remove the suffix
- automate(s), automatic, automation reduced to automat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

other text preprocessing steps (egs)

A

encoding-decoding noise
grammar checker
spelling correction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are text-to features used for and list techniques? (SESW)

A
  • To analyze pre-processed data
  • techniques
    1. Syntactical Parsing
    2. Entities / N-gram / word-based features
    3. Statistical features
    4. Word embeddings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is syntactical parsing, what does it involve, and what important attributes

A
  • involves the analysis of words and grammar and their arrangement to show relationships in word
  • Dependency on Grammar and Part of Speech (POS) are important
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is dependency grammar?

A
  • class of syntactic text analysis that deals with binary relations between two words
  • every relation can be represented in the form of a triplet
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is POS tagging

A
  • define usage and function of a word in the sentence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Describe the POS tagging problem

A
  • to determine POS tag for instance of the word

- words often have more than one POS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

where can POS tagging be used? (WINE)

A

Word sense disambiguation ( book )
Improving word-based features
Normalization and lemmatization
Efficient stopword removal

22
Q

What are the most important chunk of sentence

Which algorithms are generally ensemble models of rule based parsing etc

A

Entities, Entity Detection algorithms

23
Q

What is Named Entity Recognition (NER)

A
  • Process of detecting named entities such as person, location etc from the text
    example — {“person”: “Ben”}
24
Q

What are the three blocks NER has (NPE)

A
  1. Noun phrase identification - extracts all noun phrases using dependency parsing and POS
  2. Phrase classification - all extracted nouns are classified ( location, name etc)
  3. Entity disambiguation - validation layer on top of results
25
What is topic modeling, and what does it derive
- the process of automatically identifying topics in a text corpus - derives the hidden patterns among words in an unsupervised manner
26
Describe N-grams as features, which ones are more informative, which is most important
- a combination of N-words together is called n-grams - N> 1 are more informative - bigrams (N=2) are most important
27
What operations does Bag Of Words involve
1. Tokenization: all words tokenized 2. Vocabulary creation: unique words create vocabulary 3. Vector creation: vector row is sentence, columns are size of vocabulary
28
What is TF-IDF, what does it convert?
- a weighted model used for Information retrieval | - converts text documents into vector models
29
what is TF
- Term frequency = frequency of word in doc / total number of words in doc
30
what is IDF
Inverse document frequency = log (total number of documents / documents containing word W)
31
What is significant about TF-IDF
gives relative importance to a term in corpus
32
What is text classification
a technique to systematically classify a text object
33
what is text matching/similarity
matching text objects to find similarities
34
what is Levenshtein distance, list edit operations
minimum number of edits to transform one string into another insertion, deletion, substitution of single character
35
what is Phonetic matching
takes keyword as input and produces character string to identify words phonetically similar helps search large text corpus, correct spelling errors and match relevant names
36
What is cosine similarity
when text is represented as vector notation, the vectorized similarity can be measured COS similarity ranges from 0 to 1 Closer to 1 = 2 vectors have same orientation Closer to 0 = 2 vectors have less similarity
37
What is text summarization
given article, automatically summarize to produce most basic sentences
38
what is a machine translation
translate text from one language to another
39
What is Natural Language Generation and Understanding
Generation Converting info from computer DB or semantic intents to readable human language Understanding converting chunks of text into logical structures for computer programs
40
What is an optical character recognition
given image representing text, determine corresponding text
41
what is a document of information
parsing textual data in documents in an analyzable and clean format
42
What is a Naive Bayesian classifier and input / output
determine the most probable class label for the object - assumed independence Input - variables are discrete Output - Probability score (proportional to true probability) and Class label (based on highest probability score)
43
Use cases of NBC
Spam filtering, fraud detection
44
Describe the Bayes law
P(C | A) = P ( A & C ) / P(A) = P(A|C) P(C)/ P(A) - C is class label , A is attribute
45
How to simplify the Naive assumption
``` Turn P(A|C) = summation P(aj | cj) P(C|A) = summation P(aj | cj) * P(C) ```
46
How to build the naive classifier
Get P(Ci) for all class labels, Get P(aj | Ci ) for all A and C, assign the classifier label that maximized value of naive assumption
47
List the Naive Bayesian Implementation Considerations
Numerical Underflow - resulting from multiplying probabilities near 0 - preventable by computing log Zero probabilities - unobserved attribute/classifier pairs - handled by smoothing
48
List Precision and recall
``` Precision = TP/(TP+FP) Recall = TP/(TP+FN) ```
49
What are the two problems with using VSM
- synonymy - many ways to refer to the same object (car, automobile) - poor recall - small cosine but related - polysemy - most words have more than one meaning ( model) - poor precision - large cosine but not related
50
Solution to VSM
Latent Semantic Indexing
51
List four steps of Latent Semantic analysis
- term by document matrix - convert matrix entries to weights - rank reduced singular value decomposition - Compute similarities between entities in semantic space with cosine
52
what is SVD
- tool for dimension reduction - similarity measure based on co-occurrence - finds optimal projection into low dimensional space - generalized least squares method