Lec 6 | Language Flashcards
(48 cards)
It spans all tasks where the AI gets human language as input.
Natural Language Processing (NLP)
the AI is given text as input and it produces a summary of the text as output.
automatic summarization
the AI is given a corpus of text and the AI extracts data as output
information extraction
the AI is given text and returns the language of the text as output.
language identification
the AI is given a text in the origin language and it outputs the translation in the target language.
machine translation
the AI is given text and it extracts the names of the entities in the text (for example, names of companies).
named entity recognition
the AI is given speech and it produces the same words in text.
speech recognition
the AI is given text and it needs to classify it as some type of text.
text classification
where the AI needs to choose the right meaning of a word that has multiple meanings (e.g. bank means both a financial institution and the ground on the sides of a river).
Word Sense Ambiguation
A sentence structure
Syntax
The meaning of words or sentences
Semantics
A system of rules for generating sentences in a language.
Formal Grammar
The text is abstracted from its meaning to represent the structure of the sentence using formal grammar.
Context Free Grammar
What do these non-terminal symbols mean?
- N
- V
- NP
- VP
- S
- D
- P
- ADJ
- Noun
- Verb
- Noun Phrase
- Verb Phrase
- Sentence
- Determiner
- Preposition
- Adjective
A sequence of n items from a sample of text.
n-gram
What n-gram is this?
the items are characters
character n-gram
What n-gram is this?
the items are words
word n-gram
Continous sequence of items from a sample of text. They may have sequences of 1, 2, or 3.
3 answers
unigram, bigram, and trigram
Where can we use/implement n-gram?
It is useful for text-processing.
Since some words occur together more often then others, it is possible to also predict the next word with some probability. A helpful step in natural language processing is breaking the sentence into n-grams.
the task of splitting a sequence of characters into pieces (tokens).
Tokenization
Tokens can be words as well as sentences, in which case the tasks are called what?
word tokenization or sentence tokenization
What are challenges faced when using splitting words? How do we deal with them?
Words with apostrophes (e.g. “o’clock”) and hyphens (e.g. “pearl-grey). Additionally, some punctuation is important for sentence structure, like periods. Dealing with these questions is the process of tokenization.
- Consists of nodes, the value of each of which has a probability distribution based on a finite number of previous nodes.
- Can be used to generate text.
Markov Models
How do we use Markov Models?
we train the model on a text, and then establish probabilities for every n-th token in an n-gram based on the n words preceding it.