NLP Flashcards

1
Q

Summarization

A

idea?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Ambiguity

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

morpheme

A

a meaningful morphological unit of a language that cannot be further divided (e.g. in, come, -ing, forming incoming ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

morphological

A

relating to the form or structure of things.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Zipf’s law

A

Zipf’s law states that given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. So word number N has a frequency of 1/N.
Thus the most frequent word will occur about twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

arg max or argmax

A

the arguments of the maxima are the points of the domain of some function at which the function values are maximized

In other words, arg max is the set of points, x, for which f(x) attains the function’s largest value (if it exists). Arg max may be the empty set, a singleton, or contain multiple elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Language model?

A
  • Models that assign probabilities to sequences of words
  • the simplest model that assigns probabilities
    LM to sentences and sequences of words, the N-gram
  • Goal: Assign useful probabilities P(x)to sentences x
  • related task: assign probability P(w_i+1 | w_i) of an upcoming word

-> Probabilities should broadly indicate plausibility of sentences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Noisy Channel Model

A
  • Goal: predict sentence given acoustics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Unigram, bigram, trigram

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

stupid backoff

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Prior probability

A

a probability as assessed before making reference to certain relevant observations, especially subjectively or on the assumption that all possible outcomes be given the same probability.

the prior of an uncertain quantity is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Channel model probability

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Markov Model / Markov Chain

A

A finite state automaton with probabilistic state transitions

Makes Markov assumption that next state only depends on the current state and independent of previous history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Higher-order Markov Language Models

A

k-gram models are equivalent to bi-gram models where the state space (the words) are k-tuples of the bi-gram state space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Maximum likelihood estimator

A

P(w_i | w_i-1)_mle = count(w_i-1,w_i) / count(w_i-1)

  • it is the estimate that makes it most likely that the word “w_i” will occur x times in a y million size corpus
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Back-off Models

A
  • use trigram if you have good evidence, otherwise bigram, otherwise unigram

related itea: interpolation -> mix unigram, bigram and trigram (mix all three all the time)
-> most of the time interpolation works better than backoff

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Laplace Smoothing

A
  • add-one: pretend we saw each word one more time than we did -> add one to all counts
  • also: add-x smoothing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Back-off: Linear Interpolation

A

related itea to backoff: interpolation -> mix unigram, bigram and trigram (mix all three all the time)
-> most of the time interpolation works better than backoff

-> how to set lambdas: use held-out corpus (aside training set and test set)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

interpolate?

A

interpolation is a method of constructing new data points within the range of a discrete set of known data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Extrapolation

A

extrapolation is the process of estimating, beyond the original observation range, the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a higher risk of producing meaningless results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

gradient descent

A

tbd!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

perplexity

A

=> intrinsic evaluation of a language model
- is based on the inverse probability of the test set
-> perplexity is most common
=> is the probability of the test set, normalized by the number of words

vs. extrinsic evalution (e.g. using speech recognition system, MT system)

=> perplexity is a function of the probability of the sentence
=> also: perplexity is “weighted equivalent branching factor” (also “average branching factor”)
=> in general: the lower the perplexity, the better the model
=> minimizing perplexity is the same as maximizing probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

a better model of a text

A

is one which assigns a higher probability to the word that actually occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

the best language model

A

is one that best predicts an unseen test set

-> gives the highest P(sentence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
chain rule
permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities
26
long-distance dependencies/information
mostly N-gram models sufficient
27
how do we deal with bigram that have zero probaility?
simplest idea: Add-one (Laplace) smoothing
28
smoothing
Methods for assigning probability mass to rare events
29
backoff
use trigram if you have good evidence, otherwise back off and use bigram, otherwise unigram in practice, interpolation works better than backoff
30
interpolation
mix unigram, bigram, trigram in practice, interpolation works better than backoff
31
Add-one (Laplace) Smoothing
how do we deal with bigram that have zero probaility? -> ok for text classification, not for language modeling
32
how do we deal with bigram that have zero probaility?
simplest idea: Add-one (Laplace) smoothing
33
smoothing
Methods for assigning probability mass to rare events
34
backoff
use trigram if you have good evidence, otherwise back off and use bigram, otherwise unigram
35
interpolation
mix unigram, bigram, trigram
36
log-likelihood
tbd
37
dot-product
tbd
38
Keser-Ney smoothing
tbd
39
Open vs Closed classes
tbd
40
generative vs discriminative model
- generative: models joint probability of the labels y - discriminative: models conditional probability - discrimininative models usually harder to train (we have to solve optimization problem like gradient descent), with generative models we usually just count
41
log-linear model
- a conditional probability model - sometimes also called logistic regression model - a discriminative model - they have a feature function phi of x,y - a weight vector 1) model: ... cf 03-19 2) learning: argmax of w 3) prediction: argmax of y
42
Sequence Methods
e.g. HMM -> a generative model (we model the joint probability of input and output
43
named entity recognition (NER)
a sequence labeling task e.g. twoDigitNum, allCaps, lowercase see 03-24
44
Part of speech tags
- t
45
a sequence of categories
e.g. verb noun conj.verb adverb
46
transition & emission in HMM?
transition: y_i -> y_i+1 emission: y_i -> x_i
47
transition probabilities
--> HMM: Pr(𝑥_𝑖 | 𝑥_𝑖−1)
48
two sequences in HMM
• In a hidden Markov model there are two sequences. One is a Markov chain (the y_i’s) and one is emitted from the Markov chain (x_i’s) 1. sequence: Markov chain (the labels i.e. y_i's) 2. sequence: emission of Markov chain (the emitted x_i's)
49
joint probability
A joint probability is a statistical measure where the likelihood of two events occurring together and at the same point in time are calculated. Joint probability is the probability of event Y occurring at the same time event X occurs.
50
HMM
An HMM is a probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences, whatever), they compute a probability distribution over possible sequences of labels and choose the best label sequence. - > with HMM we model the joint probability of the input and the output (we also model the probability of seeing a certain sequence of words (input), not only tags) - > HMM makes strong assumptions
51
e(x_i,y_i)
emission probability: prob of word x_i given category y_i
52
smoothed estimation (trigram HMM)
1) for transition prob: smooth the transition parameters just like n-grams models using lambda_1,2,3 2) for emission prob: usually pseudowords are used - use training corpus to do different counts (linear interpolation)
53
HMM: Smoothing with Pseudowords
finesse this problem of low frequency words or words never seen in training by closing the vocabulary by mapping those low frequency words to a much smaller, finite set of perhabs 20 "words classes" which preserve information about the spellings
54
HMM: Inference
HMM: training, smoothing next: inference problem stmt: for a given input sentence find the most likely sequence of tags (try over all possible sequences of tags) -> argmax over sequence of y's -> but is exponential in length of set of tags => cf. Viterbi algorithm -> tractable solution
55
Viterbi algorithm
a dynamic programming algorithm for finding the most likely sequence of hidden states (sequence of tags y's) that results in a sequence of observed events
56
Dynamic programming
dynamic programming (also known as dynamic optimization) is a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions. The next time the same subproblem occurs, instead of recomputing its solution, one simply looks up the previously computed solution, thereby saving computation time at the expense of a (hopefully) modest expenditure in storage space
57
HMM makes strong assumptions
- > about how the different predictions in the sequence relate to one another: 1) The word (emitted value) x_i is independent of the rest of the variables given y_i i.e. for each x_i we assume it is sufficient to look at y_i i.e. we don't need to condition on the full sequence of labels -> ideally we want to assume more dependence between the words => BUT: Words are dependent on other words in the sentence, even given their POS => we want to assume more dependence between the words! 2) a POS tag is independent of past tags given the previous tag => BUT: POS tags have dependencies that cannot be bounded in distance
58
feature in log-linear model
- > the features in a log-linear model serve the same function as the emission probabilities in HMM - > the higher the f, the higher is prob for a certain y
59
prediction
= tagging
60
1. model 2. inference 3. learning
1. 2. i.e. given the observed variables, how do we find the most likely y = inference, cf. Viterbi algorithm 3. how to find the parameters
61
first-order model
t
62
multinomial
you have a chart / lookup table for each word and label and for each pair of labels (?)
63
log linear model
you have a feature function
64
what's the difference between a feature-based classifier and a classifier which just has a multinomial?
feature-based classifier: if you have completely different sequences of words but have similar features, if that is the case there is a connection between the sequences because of the features -> you have some influence between them even though they are not the same multinomial: you have a chart labels/words where entries are independent. it tells you if you have this word, then it has this value. you don't model any relation.
65
feature vector
a feature vector consists of say m features, of which many could be binary features or "indicator function" that ask some question
66
gradient ascent
hill climbing technique for concave functions (which have a local maximum which is also global maximum)
67
named entity recognition (NER)
tbd
68
co-reference resolution
t
69
relation extraction
t
70
apposition
t
71
ontology
t
72
supervision
t
73
precision, recall
precision: % of selected items that are correct recall: % of correct items that are selected (opposite of precision)
74
open information extraction system
should be able to indentify the ontology (ie the tables) as well as populate the database (does not have a predefined ontology eg thorugh wikipedia infoboxes, it only has text! how does it define relations? just as strings/pieces of text
75
how to cluster semantically equivalent relations? cf. 11-16 would be 3 different relation instances but represent the same syntax does not align with semantics!
a first step: Semantic role labeling (SRL) idea: take a sentence, split it into events and propositions and to be able to identify that this relation is expressed and it is very similar to other instances of the same type of relation - > idea: the roles are invariant to the different sentences but stay the same for equivalent values ie. the role is the same regardless of what syntactic position they take - it's kind of an abstraction over the sentence-level paraphrasing which you can do
76
Semantic Roles cf. 11-24
- An underlying relationship an argument has with the main verb in a clause - This relation is (ideally) invariant to paraphrases - do not necessarily align with syntactic roles
77
3 main representation schemes for SRL
1) FrameNet scheme judgment frame; roles: judge, evaluee, reason; verb "blame" invokes the judgment frame -> very accurate, but coverage very limited (only a few thousands of frames - semantic roles are defined by the frame - a frame does a clustering of verbs (blame, hail, etc have the same frame) - is more difficult to parse - frames are manually defined - a big problem: frame coverage 2) ProBank scheme idea: semantic roles are more fine-grained than in FrameNet - idea: try to abstract over the different instantiations of the same verb - more frequently used than FrameNet, ProBank is more agnostic, does not generalize over verbs at all - more fine-grained: e.g. She blames... or She praises... would be put in same frame with FrameNet but use 2 different semantic roles in PropBank 3) VerbNet scheme - takes max. coarse-grained approach - 22 roles only -> very generalized, sometimes very difficult to pick the right role because very abstractly defined - crux is: use same set of roles across different classes??
78
semantic role labeling (SLR)
you want to abstract away from the syntax to directly represent the semantic relations, e.g. FrameNet (big dictionary of frames), several verbs can invoke the same frame (eg judgment frame); frames are manually defined - SLR is a simplified form of semantic representation
79
different type of dictionary: semantic fields
clustering of words that share a meaning component
80
syntactic parse
t
81
error propagation
usually when you separate one prediction task into 2, situation arises called "error propagation": you make a mistake at first stage, then no matter what you do, you're already doomed when you get to the second step -> that's a reason why people don't use pipelines
82
sparsity
tbd
83
classifier training vs feature?
t
84
precision, recall, F-score
t
85
NLP
- information retrieval - machine translation - question answering - summarization
86
information extraction
map text to database entries
87
Ambiguity
- key challenge - Language has much underlying structure, which is ambiguously expressed - Ambiguity is at all levels: - > words can sound the same but not mean the same (bank vs. bank) - > Sentences may look the same, but have different meanings - Alongside ambiguity, there is also redundancy: many different ways to say the same thing
88
morpheme
is the smallest grammatical unit in a language. In other words, it is the smallest meaningful unit of a language. The linguistics field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word, by definition, is freestanding. When it stands by itself, it is considered as a root because it has a meaning of its own (e.g. the morpheme cat) and when it depends on another morpheme to express an idea, it is an affix because it has a grammatical function (e.g. the –s in cats to indicate that it is plural). Every word comprises one or more morphemes.
89
Shannon’s game (Language Modeling Problem)
given a sequence of words, what’s the distribution of the next word
90
N-gram
An N-gram is a sequence of N N-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”. We’ll see how to use N-gram models to estimate the probability of the last word of an N-gram given the previous words, and also to assign probabilities to entire sequences => N-grams work well only for word prediction if the test corpus looks like the training corpus!
91
unigram model
=> simplest Markov model: unigram model -> simply estimate prob of a sequence of words by product of probs by individual words (words are independent in this model)
92
bigram model
- Condition on previous single word - slightly more intelligent than unigram model extend further to trigram, 4-gram, 5-gram => in general, insufficient model of language because language has long-distance dependencies
93
markov assumption
a simplyfing assumption (because we cannot count how many times a sequence of words occurs): -> the next state only depends on the current state and independent of previous history -> P(the | its water is so transparent that =~ P(the | that) or P(the | its water is so transparent that =~ P(the | transparent that) more formally: P(w_i | w_1w_2...w_n) =~ P(w_i | w_i-k...w_i-1) => estimate by just the last few words => simplest Markov model: unigram model -> simply estimate prob of a sequence of words by product of probs by individual words
94
Markov Model
``` The assumption that the probability of a word depends only on the previous word is called a Markov assumption. => Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past => Markov models assume the probability of a sequence can be given as a product of the transition probabilities ```
95
estimate probability of a sentence using bigram
``` P(I want english food ) = P(I | ) x P(want | I) x P(english | want) x P(food | english) x P( | food) ```
96
why log space
- to avoid arithmetic underflow - adding is faster than multiplying p_1 x p_2 x p_3 = log(p_1) + log(p_2) + log(p_3)
97
a held-out corpus
A held-out corpus is an additional training corpus that we use to set hyperparameters like these λ values, by choosing the λ values that maximize the likelihood of the held-out corpus
98
how to deal with bigrams with zero probability?
- simplest idea: add-one smoothing
99
text classification
t
100
Discounting Methods
-> Maximum likelihood estimates tend to be too high for low count items => One simple way to handle this is to discount all counts by a constant term: (say, 0.5) But what do we do with the missing probability mass? -> collect it from the number of discounts made and distribute this mass to bigrams with count 0
101
Kneser-Ney Smoothing
cf P_continuation
102
classification: 3 main ideas
- Representation in a feature space - Scoring by linear functions - Learning by optimizing
103
joint vs. conditional probability
tbd
104
con·di·tion·al prob·a·bil·i·ty
the probability of an event ( A ), given that another ( B ) has already occurred
105
Naives Bayes
- Simplest Generative Model Naives Bayes is a probabilistic classifier, meaning that for a document d, out of all classes c ∈ C the classifier returns the class c^ which has the maximum posterior probability given the document
106
feature space
t
107
bag of words model
the bag of words assumption: the position of the words is ignored => we make use of the frequency of each word
108
supervised learning
for the training data, you know the labels
109
feature function
t
110
likelihood
or log likelihood the prob of seeing the labels given the inputs
111
regularization
to combat overfitting with log-linear model, we add a regularization term -> idea: to penalize the increase of w especially for low-appearance features
112
neural net
non-linear, non-convex architecture vs. log-linear models: linear, convex
113
feed forward neural network
t
114
neural network
t
115
backpropagation algorithm
t
116
one-hot vector
cf. Feed-forward Neural Network
117
softmax layer
- > a standard way of forming a multi-label classifier | - Softmaxlayers turn vector outputs into a probability distribution
118
morphological paradigms
t
119
POS tagging problem
The POS tagging problem is to determine the (single) POS tag for a particular word token (instance)
120
main sources of information for POS tagging
1. Knowledge of word probabilities (to man is rare) | 2. Knowledge of neighboring words
121
Sequence Labeling as Classification
aim: predict category of each word Approach 1: classify each token independently but use as input features and information about the surrounding tokens (sliding window) (but without category of neighboring words) Approach 2: using Outputs as Inputs: Better input features are usually the categoriesof the surrounding tokens (not easy to integrate), but these are not available yet -> use category of either the preceding or succeeding tokens by going forward or backward and using previous output - > forward classification - > backward classification
122
probabilistic sequence model
- build a model that gives you a prob for the entire sequence - allows integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment => classic solution: Hidden Markov Model (HMM) -> a global model: means it maximizes over sequences rather than local words only i.e. gives a prob for the entire sequence of words
123
hidden markov model (HMM)
an HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model
124
HMM - why hidden?
we didn’t observe part-of-speech tags in the world; we saw words and had to infer the correct tags from the word sequence. We call the part-of-speech tags hidden because they are not observed
125
Trigram HMM
t
126
conditional independence assumption
given x and y_i-1, y_i is conditionally independent of everything that preceded it (simliar to Markov assumptions) -> in MEMM, unlike HMMs, the words are always conditioned on, there is no assumption re the words words are conditionally independent of each other given the class
127
Markov chain
A Markov chain is a special case of a weighted automaton in which weights are probabilities (the probabilities on all arcs leaving a node must sum to 1) and in which the input sequence uniquely determines which states the automaton will go through -> A Markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world
128
first-order Markov chain
A Markov chain embodies an important assumption about these probabilities. In a first-order Markov chain, the probability of a particular state depends only on the previous state
129
discriminative models
- allows us more dependencies between different words (than e.g. HMM) through the use of discriminative models which also allows to use feature representations
130
Maximum Entropy Markov Models (MEMM)
- here, I assume that p(y_i|y_i-1,x_1....x_m) is a log-linear model - is the product of transitions - unlike HMM, here I don't assume anything about the independence of the words: I condition on the words, and there is no assumption that words are independent or conditionally independent - 2nd assumption: the structure of each of the factors in the product forms a log-linear model - f = feature function - w = weight vector - the denumerator is to normalize the expression
131
why exponentiate? (e.g. in log-linear model)
t
132
multinomial vs. feature function based model?
- multinomial: you have a chart/table/lookup-table for each word and label: coarse-grained - feature function: fine-grained difference: cf. 20171119_120054[1].mp4 26:00ff
133
history (context: feature function)
why history? - > y_i is conditioned on y_i-1 and all the words = this is called the history - > so it includes everything you condition on, not just observed values but also labels (e.g. the preceding tag)
134
analogy MEMM and HMM (re transition and emission prob)
MEMM: with the feature function we're trying to find relevant information for defining which y's are more probable given everything that I condition on - the features with word/tag pairs in feature function f in MEMM serves the same purpose as the emission probability in HMM (of course! -> if the emission prob is high for a certain y given an x, then the feature function together with the weight vector should also yield a high probability for this y - some features n f correspond to transition probabilities in HMM: eg f_103 considers the two previous tags. You can have both trigram and bigram transition probabilites in f, you don't have to restrict yourself to one of them
135
block notation
t
136
Ratnaparkhi’s Tagger
05-05 - f_100 corresponds to emission prob e(x_i | y_i) in HMM - f_101 etc aare additional features that HMM could not cover - you can use trigram, bigram, unigram model in the same feature function set - transition feature f_103 corresponds to transition prob q(y_i | y_i-1) in HMM (trigram) and transition feature f_104 same but bigram and transition feature f_105 same but unigram => AND therefore: it corresponds to backoff model smoothing method: if f_103 does not hold then maybe f_104 holds and so forth -> 103 would have very high weight, 104 bit less, 105 again bit less (like a backoff model) => you can combining any number of features! -> becomes more effective
137
log-linear model: when we have probabilistic classifiers, we also need to care about 2 other things aside the model?
- train it and to do inference - given training data, we want to estimate the parameters (=the weights w), and we need to know how to do inference: given weight values and x, what is the most likely y
138
MLE in MEMM
sl 05-05
139
differentiate the log likelihood of a log-linear model
sl 05-05
140
inference in MEMMs
pretty much the same as for HMM: use Viterbi sl 05-10/11
141
difference between inference and learning? (HMM, MEMM)
- inference (ie finding the max y's): finding a maximum path on the grid representation - > we have w and want to find the most likely y's - learning:
142
label bias
- problem with MEMM, HMM -> occurs with any locally-normalized model - the model favors labels that predict well the next state over labels that predict poorly the next state, it will be biased towards them because of the high probability => your model predicts an unlikely/infrequent sequence of labels just because their probability does not decay exponentially (e.g. long book names) yet that sequence is very rare - could be harmful, not always - e.g. in POS tagging it is usually not a problem
143
local normalization
a normalization term as denominator makes sure that the sum of probabilities of all possible values of y_j is 1 equivalent on a grid represenation: the sum of all outgoing edges of a node is 1
144
prediction vs inference vs learning?
inference := the process of inferring something. a conclusion reached on the basis of evidence and reasoning.
145
Sequence Conditional Random Fields (CRFs)
- CRFs are similar to MEMMs, but the normalization is global - this solves the label bias: the scores of the next state given the current one need not sum up to 1 - the normalization is not done for each transition but rather the sum of all paths given the x sums up to 1 - each transition is not a probability anymore but rather a score/(or potentials) - you still normalize but over the sum of all the paths - we assume all different paths for a sentence sum up to 1, it might be that the outoging edges of a node do not sum up to 1 - CRF has been standard NLP model until NN came
146
why normalize at all?
- it makes sure that if you increase the prob of some path you will decrease the prob of others - it serves as a regularization because otherwise you could assign infinity scores, there would be no limit, no boundaries - > normalization makes you work with a fixed budget, then the problem has a unique maximum that you can find
147
how do we train w for CRF?
- how to estimate/train w, how to find the most llikely weights? - no close form formula - write conditional log likelihood and maximize it - > compute gradient 05-17, but has intractable subtrahend term, then transform it st it has a polynomial term Pr(...) (with number of tags squared), and compute this term Pr(...) using dynamic programming for all values y_j and y_j-1 and over all j's
148
gradient computation CRF
t
149
forward-backward algorithm (CRF)
- > similar to Viterbi but instead of max I take the sum - M_i: unnormalized transition weight (not a probability!) to transition from y to y' alpha: is the sum over all paths of length i-1 that end with y (forward term) beta: same but start with y and go all the way to the end (backward term) - how is it helpful? I can compute it (tractable)
150
domain adaptation
- > trained with wall street journal, then apply it to wikipedia - Accuracies degrade outside of domain - e.g. try to find instances you're most confident in - e.g. use lexicons
151
Sentiment Analysis: Definition
- Sentiment analysis is the detection of attitudes • Simplest task: Is the attitude of this text positive or negative? • More complex: Rank the attitude of this text from 1 to 5 • Advanced: Detect the target, source, or complex attitude types - a simple classifier: Log-linear or Naïve Bayes classifier
152
Negation in Sentiment Analysis
- One practice is to add NOT_ to every word between negation and following punctuation (useful if you want to stick to the bag of words model, but it still fails: "didn't like" would still get a positive polarity - if you have this heursistic of adding NOT_, then instead of "like" you would have a "NOT_like" - > but it is still a very crude solution
153
Sentiment Analysis as Sequence Labeling
- This construal of sentiment analysis attempts to capture the meaning of a word in contextby encoding (parts of) the sentence as features - Recently: using Recurrent Neural Networks (RNNs) - RNNs are FSAs: ) No longer finite in the same sense ) The state is an embedding of the history in a continuous space
154
Recurrent Neural Networks
- Map from dense sequence to dense representation - > idea RNN: save the history as a state 06-17 - R is parametrized and parameters are shared across steps -> parameters of R are learned, it's the history, everything we've learned so far ~ representation learning -> it obviates the need to create a feature function -> let the system find the features -> it is the NN dogma - with RNN, you don't do any global optimization (you compute a y and move to the next) - RNN: no dynamic programming for inference - > the RNN computes a classification for each word on its own - RNN models are able to take the context (both preceding and following) into account, as well as the linear order between the words (bag of words cannot)
155
sequence labeling
t
156
Bi-RNN
- When using RNNs for sentence classification (such as sentiment analysis), it is often practical to use Bi-RNNs - Bi-RNN: 2 RNNs, one going back to forth and the other forth to back, Output function is a function of both states - This allows the states associated with each word to encode relevant information from the words following them and preceding them - difference from bag-of-words: the result of the latter is just based on what a word is, with RNN it is based on what that word is and the state which encodes both the words preceding it and the words following it - > it is kind of a modified/enhanced bag-of-words
157
sentiment analysis with Bi-RNN
- bottom line: encode the state/output for each word and somehow aggregate them - one simple way to do sentiment analysis (or other sentence classification) with Bi-RNNs is to average the output sequence and train a binary logistic classifier for predicting the sentiment
158
Neural Language Models
t
159
Character-level Language Models
t
160
what is grammar?
- eg The rules and principles that guide the structure of sentences in language
161
why do we need grammar?
Two main motivations: - Structural Analysis:decomposing complex units into simpler sub-structures can assist learning and generalization - Semantic Information: grammatical structure often reflects semantic structure and distinctions - > Grammatical analysis in NLP involves breaking the text into simpler parts and categorizing them - > Categorization allows information to propagate from one type to another (important!) -> allow insight that plain sequences (either a word is close or distant) do not (whereas with categorization you have a tree where words can be in all sorts of configurations to one another even if they are distant)
162
a word
- Practical: anything that is between two spaces - What about languages that don’t have spaces? - Phonological words - Grammatical words
163
morphology
is the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and parts of words, such as stems, root words, prefixes, and suffixes. Morphology also looks at parts of speech, intonation and stress, and the ways context can change a word's pronunciation and meaning
164
clause (grammar)
- basic syntactic unit - difficult to define - some kind of description of some activity, state or property - more formally: simplest kind of sentence - a sentence can have several clauses - simple clauses have a predicate, core arguments (usually subjects and objects), and non-core arguments (e.g., location, manner, instrument) - > predicate = some main relation
165
Inter-Clause Relations (3x) (grammar)
- basic element of grammar Three main inter-relations between clauses within a sentence: -> complement clauses: Clauses that serve as an argument in another clause •“John said he will kick the ball” -> relative clauses: Clauses that modify an argument of another clause •“John, who has studied the language for years, speaks German well” -> linked clauses: • After he had studied it for years, John could speak German well
166
Basic Units (grammar)
- basic element of grammar: 1) noun phrase: - an argument of a clause is generally a noun phrase Two major classes: 1) Proper nouns or names 2) Common nouns - Noun phrases generally have a head (that determines their category, a word the phrase centers around and it determines its type (usually a noun)) -> a word that determines the semantic type of the whole phrase, and is an essential component of its meaning: The *man* who wasn’t there 2) Syntactic Heads: - the (main) verb is considered the head of a clause (in English at least) (take it as a convention) - Arguments and modifiers are considered its dependents: Johnran home (“ran” is the head of the clause, “John” and “home” are its dependents) -> Some schemes view all syntactic relations as head-dependent relations
167
Dependency parse
- syntax is represented as head-dependent relations between words (cf tree structure) - Nodes are words, edges are relations - constituents are subtrees, but the words are all the nodes - The arguments and modifiers of a word are its children - Edge labels mark the type of relation - The verb (=predicate) of the main clause is the root => the assumption: every phrase has a head -> this assumption allows me to build a syntactic tree where edges between words represent bi-lexical dependencies
168
predicate
- the thing asserted (usually the verb) | - the property that you are asserting
169
Phrase-based (constituency) parse
- represent the sentence as a collection of nested phrases - > it brackets words into phrases, parent phrases and so on - Words are at the leaves, phrases/constituents are subtrees - Non-terminal labels represent the phrase type. This is determined by the type of the headword (e.g., Noun Phrases (NP) are phrases headed by a noun)
170
what information can you get from syntactic structure? why do we want to use hierachary? or syntactic representation? what's wrong with just representing sentences as sequences?
- linear structure might lead us astray: two words might be far away from each other but might have a direct dependency between them ("war horses which were bought by the police like to drink") - same goes for speech tags or sentiment analysis (e.g. negation scope) - -> this syntactic structure helps us achieve all kinds of useful things, do better language modeling, better disambiguation, negation scope - > linear structure is just an approximation - > in order to really model what's going on we need some kind of hierarchy
171
information from syntactic structure we can get?
- phrases (e.g. for machine translation) - > use a phrase-based parse /constituency tree of the English sentence and then change subtrees in order to get order of words in Japanese (tree transformation) (word permutation only won't help here) - > phrase move around in the tree and not single words - > same would be possible with dependency tree (then you'd move subtrees) - extract grammatical relations - > extract subject, object or all events from a sentence - allows to do syntactic disambiguation - > some are detectable directly from the syntactic tree - a proxy to semantics - > by knowing the syntactic tree I already know quite a bit about of information and can answer semantical questions
172
dependency vs constituency tree
dependency: - constituents are subtrees, but the words are all the nodes - Verbs are heads of clauses, nouns are heads of noun phrases constituency: - Words are at the leaves, phrases/constituents are subtrees - does not have the notion of a headword, it just puts brackets and has categories but does not tell you the headword - > however, you can extract the headword based on the categories in the tree - > basically it is possible to convert constituency trees into dependency trees (vice versa is trickier)
173
head
the head is the word that determines the type of the phrase
174
examples dependency trees (universal dependency format)
cf | 08-syntax_von_pptx_1210.pdf, sl. 30,31
175
Graph-Based Dependency Parsing
- task of taking a piece of text and generating a dependency tree for it - > what learning algorithm do we use?
176
sources of information for dependency parsing
- Bi-lexical affinities - Dependency distance (mostly nearby words) - Intervening material (e.g. a verb is unlikely to appear between two dependent word) - valency (how many dependents a word has, e.g. give has two objects)
177
dependency parsing evaluation (evaluation measures)
- accuracy = #correct edges / #number of words | - labeled/unlabeled attachment score
178
Graph-based Parsing (context: dependency tree)
- a "structured prediction problem" | - MST Parser: minimum spanning tree (same algorithm as maximum spanning tree)
179
MST parser
MST: minimum spanning tree (same algorithm as maximum spanning tree) 1) build complete directed graph (except ROOT) - > each spanning tree rooted in tree is a possible tree - > gives the space of possible parses - > first assign a weight to each edge how likely it is to appear in the MST - how do I score the edges? - find model that gives weight to edges (whether it is likely or not) based on information that tells us if this edge is relevant - we use feature function phi (usually binary function) and a weight vector to assign scores to edges - usually POS tagging is part of preprocessing for MST parser (then you can use POS tags in the feature function)
180
MST parser: inference and learning
inference: is finding the maximum directed spanning tree - > score each edge based on its features (Chu-Liu Edmonds algorithm) learning: - use log-linear model: Pr(T) for all possible trees - using the gradient of the log-likelihood I can find the optimal teta to maximize the log-likelihood - > but computing the gradient is computationally expensive (we have to do for all possible trees for V) - > the common way to do optimization/find the most likely teta is not by computing the gradient but by using the averaged perceptron algorithm
181
averaged/structured perceptron algorithm
- sl 08-13 - > a (supervised) learning algorithm! (for learning teta) - > that learned teta is used in Pr(T) - > Pr(T) used to find (best/most likely) dependency tree for a sentence - instead of taking the expected value, take the maximum value/tree (done by inference: you compute T' (the MST maximum tree) by using inference, T_i is the correct tree from training data) - learning of teta done by the following: you take the sum over all the edges in the correct tree (gold standard) minus the sum of all the edges in the predicted tree
182
higher-order features (graph-based parsing)
- normally features are based on single edge alone - higher-order features are types of features that cannot be encoded on edge level alone - MST inference works well when the model factorizes over edges e.g. sibling relations: -> features defined over two edges that have the same parent grandparent features: -> features defined over two edges, where one’s child is the parent of the other valency: -> the number of children a node has. Defined over all edges sharing a parent => formally, the score function factorizes over sets of edges problem: you cannot run the MST algorithm anymore - > if u want take scoring function that is over edge pairs or triplets you cannot run exact inference anymore (becomes intractable), but approximate methods exist for computing inference
183
transition-based parsers
- In transition-based parsing, the problem is decomposed to finding a sequence of transitions - we don't assume any grammar or context in transition based parsing
184
arc-standard parsers
- you have a buffer and a stack and a transition set, you build edges and nodes - transition set T: SHIFT, LEFT-ARC_l, RIGHT-ARC_l - can only parse projective trees - > straightforward way to define a transition system - bottom-up approach
185
arc-eager transition system
- you have a buffer and a stack and a transition set - more incremental ~ it is more eager to create arcs - transition set T: SHIFT, LEFT-ARC_l, RIGHT-ARC_l, REDUCE - complexity at most 2n - Builds left-dependents bottom-up and right-dependents top-down - arc-standard has to find whole subtree in order to attach it arc-eager does not - we don't know yet how to choose the next transition - arc-eager does top-down and bottom-up parsing
186
projectivity
- intuititve: a tree in which there re no crossing arcs - > ie all sub-trees are sub-strings - projective trees much easier to handle (crossing edges make it much more diffficult)
187
transition systems
- Transitions operate on the parser configuration (or state): c = (E,B,A) A transition system is defined as: S = (C, T , c_s , C_t ) - transition sequence - > we transition from one configuration to next by transition - transition: a function from configuration to configuration
188
greedy-transition based parsing
- want to find the most probable sequence of transitions - in greedy parsing, instead of modeling this joint probability (the whole sequence i...m which maximizes this product sl 09-21), we make the assumption that its enough to choose the best transition each time given what we have so far (you make that decision and can't change it - that's why it's greedy) - a score s(t, c) estimates this probability where t= a candidate transition c= current configuration
189
inference in greedy-transition based parsing?
argmax_t_1...t_m = PI_i^m = P(t_i | c_i-1)
190
oracle for arc-standard/arc-eager
- a rule-based algorithm, does not need training | - it gets as input a correct tree
191
feedforward vs recurrent neural networks
t
192
transition classifiers (kinds of score functions)
- linear transition classifier - NN transition classifier - Bi-RNN encoder
193
error propagation in transition-based parsing
- once wrong transition is picked, then recovery very difficult since parsing is greedy
194
beam search algorithm
- compromise between greedy and exhaustive search | - k-beam
195
why is parsing hard?
- An exponential number of options per sentence - Ambiguity - Many many rules to learn (exceptions) - Language speakers have no explicit knowledge of the rules of grammar
196
Constituency Parsing and Context-free Grammars (CFGs)
- you get annotated trees (eg from treebank) and you derive the parsing ruls from them
197
Probabilistic CFG (PCFG)
- not a deterministic parse - think of it as a multinomial (each symbol has 1-n possible derivations) - each tree has a probability - independence assumption: The problem is that the independence assumptions are very strong: Each production is independent of any other production e. g.: - All clauses have the same distribution
198
A sequence of rules is the same as a parse tree
- the two representations are equivalent
199
what's the idea in context free grammar?
- rules are context-free: they ignore context | - each rule is independent of the others
200
Grammar-based vs. Discriminative Approaches to Parsing
t
201
model - learning - inference
t
202
constituency parsing problem
you want to find the tree with the highest score
203
Chomsky Normal Form (CNF)
- A common binarizationis the Chomsky normal form, which is a grammar where each production is either of the form A -> B C or A -> w, where A, B and C are pre-terminals and w is a terminal - > often useful to treat all CFG rules as binary - > every CFG grammar can be converted to a CNF, which includes the same sentences
204
Cocke-Kasami-Younger (CKY) Constituency Parsing (algorithm)
- input assumed to be in CNF - > a recognizer that will only determine if a sentence is in the language - > It is simple to extend it into a parser that also constructs a parse tree
205
Head Lexicalization
- to overcome attachment ambiguity - trees are only different in terms of different rules applied -> nothing relates the words its only unlexicalized rules => a deficiency of the vanilla PCFG -> head lexicalization is a standard way to overcome this -> assume each constituent has a headword -> add headwords to each phrasal node -> now you have more info at the rule node because headword is most informative word of subtree/that constituent (*) -> I'll need to retrain my PCFG but it will cause huge sparsity problems: not just only so many eg 15 non-terminals, no I have 15 times the size of the vocabulary -> has horrible sparsity => what to do now? use eg the Collins parser bottom line: Lexicalization can resolve ambiguities and lead to huge boost in performance - the cost is huge sparsity (*) The head word of a phrase gives a good representation of the phrase’s structure and meaning
206
headword
the most informative word/element in the subtree
207
the Collins parser
- keep the head lexicalization - to get over sparsity (introduced by head lexicalization), let's make more (conditional) independence assumptions - > each rule is independent of the other - > each rule is not just a multinomial, it is a product of several terms - > no dependency between different siblings => This model still has many parameters, but it reduces each of the terms in the product to depend on 2-3 lexicalized categories, instead of n+m+2 for a regular lexicalized PCFG of this model (cf. 10-36)
208
learning and inference in constituency parsing
tbd
209
Unsupervised Learning
- paradigm of supervised learning: you have label data (you know the right answer) - clustering: main techniques for generalization without having label set - want to cluster words according to their part of speech without having annotated data - discriminative models would not make sense: we don't have labels (relation between labels and observed features not possible to work with) - generative model can be useful
210
POS Tagging: a structured prediction task
t
211
unsupervised POS induction
Given a large corpus of raw text, partition the words into categories with respect to their grammatical function POS tags are context-dependent -> makes the prediction much more difficult (e.g. the word 'back') -> words will receive one POS tag only but if the most likely is taken the main tag will account for more than 90% of the probability mass (type-level clustering)
212
Supervised vs. (Completely) Unsupervised Learning
t
213
Distributional Clustering
-> words will receive one POS tag only but if the most likely is taken the main tag will account for more than 90% of the probability mass (type-level clustering)
214
Unsupervised learning for Hidden Markov Models
- adjectives have similar distribution (approx. definition of a POS tag) - nouns also -> POS Induction through PCA over word co-occurrence Matrices
215
type vs token (contexts POS)
tokens are instances of type
216
The Prototype Tagger
- another type-level POS induction algorithm | -
217
structured prediction
we label the entire sequence
218
1-to-1 accuracy measurement
each class is assigned exactly one cluster
219
m-to-1 accuracy measurement
a class may be assigned to multiple clusters
220
Markov Chain
- simplest Markov model | - It models the state of a system with a random variable that changes through time
221
conditional likelihood
In discriminative models we often talk about the conditional likelihood