NLP Flashcards
Summarization
idea?
Ambiguity
t
morpheme
a meaningful morphological unit of a language that cannot be further divided (e.g. in, come, -ing, forming incoming ).
morphological
relating to the form or structure of things.
Zipf’s law
Zipf’s law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. So word number N has a frequency of 1/N.
Thus the most frequent word will occur about twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
arg max or argmax
the arguments of the maxima are the points of the domain of some function at which the function values are maximized
In other words, arg max is the set of points, x, for which f(x) attains the function’s largest value (if it exists). Arg max may be the empty set, a singleton, or contain multiple elements.
Language model?
- Models that assign probabilities to sequences of words
- the simplest model that assigns probabilities
LM to sentences and sequences of words, the N-gram - Goal: Assign useful probabilities P(x)to sentences x
- related task: assign probability P(w_i+1 | w_i) of an upcoming word
-> Probabilities should broadly indicate plausibility of sentences
Noisy Channel Model
- Goal: predict sentence given acoustics
Unigram, bigram, trigram
t
stupid backoff
t
Prior probability
a probability as assessed before making reference to certain relevant observations, especially subjectively or on the assumption that all possible outcomes be given the same probability.
the prior of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account
Channel model probability
t
Markov Model / Markov Chain
A finite state automaton with probabilistic state transitions
Makes Markov assumption that next state only depends on the current state and independent of previous history
Higher-order Markov Language Models
k-gram models are equivalent to bi-gram models where the state space (the words) are k-tuples of the bi-gram state space
Maximum likelihood estimator
P(w_i | w_i-1)_mle = count(w_i-1,w_i) / count(w_i-1)
- it is the estimate that makes it most likely that the word “w_i” will occur x times in a y million size corpus
Back-off Models
- use trigram if you have good evidence, otherwise bigram, otherwise unigram
related itea: interpolation -> mix unigram, bigram and trigram (mix all three all the time)
-> most of the time interpolation works better than backoff
Laplace Smoothing
- add-one: pretend we saw each word one more time than we did -> add one to all counts
- also: add-x smoothing
Back-off: Linear Interpolation
related itea to backoff: interpolation -> mix unigram, bigram and trigram (mix all three all the time)
-> most of the time interpolation works better than backoff
-> how to set lambdas: use held-out corpus (aside training set and test set)
interpolate?
interpolation is a method of constructing new data points within the range of a discrete set of known data points
Extrapolation
extrapolation is the process of estimating, beyond the original observation range, the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a higher risk of producing meaningless results
gradient descent
tbd!
perplexity
=> intrinsic evaluation of a language model
- is based on the inverse probability of the test set
-> perplexity is most common
=> is the probability of the test set, normalized by the number of words
vs. extrinsic evalution (e.g. using speech recognition system, MT system)
=> perplexity is a function of the probability of the sentence
=> also: perplexity is “weighted equivalent branching factor” (also “average branching factor”)
=> in general: the lower the perplexity, the better the model
=> minimizing perplexity is the same as maximizing probability
a better model of a text
is one which assigns a higher probability to the word that actually occurs
the best language model
is one that best predicts an unseen test set
-> gives the highest P(sentence)