Module 9 Flashcards
What is NLP?
produces machine-driven analyses of text
Why is NLP a hard problem
Language is ambiguous, multiple people may interpret it differently
Applications of NLP (amn)
- automatic summarization
- machine translation
- named entity recognition
What is corpus
collection of written texts that serve as a dataset
What are token and tokenization
a string of contiguous characters between two spaces can be an integer, real, number with a colon
converting text to tokens
What is text preprocessing + 3 steps
data is not analyzable without pre-processing steps - Noise removal - Lexicon normalization - object standardization
what is noise removal?
removal of all noisy entities in text, not relevant to data
what are stopwords
is, am common words
What is a general approach to noise removal?
- prepare a dictionary of noisy entities and iterate text object by words to eliminate those existing in both
What is lexicon normalization
converts all disparities of the word to normal form
converts high dimensionality to low dimensionality
player, played -> play
what are the most common normalization practices
Stemming and lemmatization
what is lemmatization
gets root of the word -> dictionary headword form
am are is -> be
car cars car’s -> car
what are morphemes
small meaningful units that makeup words
what is stemming
stemming is a rudimentary rule-based process to remove the suffix
- automate(s), automatic, automation reduced to automat
other text preprocessing steps (egs)
encoding-decoding noise
grammar checker
spelling correction
What are text-to features used for and list techniques? (SESW)
- To analyze pre-processed data
- techniques
1. Syntactical Parsing
2. Entities / N-gram / word-based features
3. Statistical features
4. Word embeddings
What is syntactical parsing, what does it involve, and what important attributes
- involves the analysis of words and grammar and their arrangement to show relationships in word
- Dependency on Grammar and Part of Speech (POS) are important
what is dependency grammar?
- class of syntactic text analysis that deals with binary relations between two words
- every relation can be represented in the form of a triplet
What is POS tagging
- define usage and function of a word in the sentence
Describe the POS tagging problem
- to determine POS tag for instance of the word
- words often have more than one POS
where can POS tagging be used? (WINE)
Word sense disambiguation ( book )
Improving word-based features
Normalization and lemmatization
Efficient stopword removal
What are the most important chunk of sentence
Which algorithms are generally ensemble models of rule based parsing etc
Entities, Entity Detection algorithms
What is Named Entity Recognition (NER)
- Process of detecting named entities such as person, location etc from the text
example — {“person”: “Ben”}
What are the three blocks NER has (NPE)
- Noun phrase identification - extracts all noun phrases using dependency parsing and POS
- Phrase classification - all extracted nouns are classified ( location, name etc)
- Entity disambiguation - validation layer on top of results