NLP Flashcards

Question

chain rule

Answer 1

permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities

Answer 2

mostly N-gram models sufficient

Answer 3

simplest idea: Add-one (Laplace) smoothing

Answer 4

Methods for assigning probability mass to rare events

Answer 5

use trigram if you have good evidence, otherwise back off and use bigram, otherwise unigram in practice, interpolation works better than backoff

Answer 6

mix unigram, bigram, trigram in practice, interpolation works better than backoff

Answer 7

how do we deal with bigram that have zero probaility? -> ok for text classification, not for language modeling

Answer 8

simplest idea: Add-one (Laplace) smoothing

Answer 9

Methods for assigning probability mass to rare events

Answer 10

use trigram if you have good evidence, otherwise back off and use bigram, otherwise unigram

Answer 11

mix unigram, bigram, trigram

Answer 12

- generative: models joint probability of the labels y - discriminative: models conditional probability - discrimininative models usually harder to train (we have to solve optimization problem like gradient descent), with generative models we usually just count

Answer 13

- a conditional probability model - sometimes also called logistic regression model - a discriminative model - they have a feature function phi of x,y - a weight vector 1) model: ... cf 03-19 2) learning: argmax of w 3) prediction: argmax of y

Answer 14

e.g. HMM -> a generative model (we model the joint probability of input and output

Answer 15

a sequence labeling task e.g. twoDigitNum, allCaps, lowercase see 03-24

Answer 16

e.g. verb noun conj.verb adverb

Answer 17

transition: y_i -> y_i+1 emission: y_i -> x_i

Answer 18

--> HMM: Pr(𝑥_𝑖 | 𝑥_𝑖−1)

Answer 19

• In a hidden Markov model there are two sequences. One is a Markov chain (the y_i’s) and one is emitted from the Markov chain (x_i’s) 1. sequence: Markov chain (the labels i.e. y_i's) 2. sequence: emission of Markov chain (the emitted x_i's)

Answer 20

A joint probability is a statistical measure where the likelihood of two events occurring together and at the same point in time are calculated. Joint probability is the probability of event Y occurring at the same time event X occurs.

Answer 21

An HMM is a probabilistic sequence model: given a sequence of units (words, letters, morphemes, sentences, whatever), they compute a probability distribution over possible sequences of labels and choose the best label sequence. - > with HMM we model the joint probability of the input and the output (we also model the probability of seeing a certain sequence of words (input), not only tags) - > HMM makes strong assumptions

Answer 22

emission probability: prob of word x_i given category y_i

Answer 23

1) for transition prob: smooth the transition parameters just like n-grams models using lambda_1,2,3 2) for emission prob: usually pseudowords are used - use training corpus to do different counts (linear interpolation)

Answer 24

finesse this problem of low frequency words or words never seen in training by closing the vocabulary by mapping those low frequency words to a much smaller, finite set of perhabs 20 "words classes" which preserve information about the spellings

Answer 25

HMM: training, smoothing next: inference problem stmt: for a given input sentence find the most likely sequence of tags (try over all possible sequences of tags) -> argmax over sequence of y's -> but is exponential in length of set of tags => cf. Viterbi algorithm -> tractable solution

Answer 26

a dynamic programming algorithm for finding the most likely sequence of hidden states (sequence of tags y's) that results in a sequence of observed events

Answer 27

dynamic programming (also known as dynamic optimization) is a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions. The next time the same subproblem occurs, instead of recomputing its solution, one simply looks up the previously computed solution, thereby saving computation time at the expense of a (hopefully) modest expenditure in storage space

Answer 28

- > about how the different predictions in the sequence relate to one another: 1) The word (emitted value) x_i is independent of the rest of the variables given y_i i.e. for each x_i we assume it is sufficient to look at y_i i.e. we don't need to condition on the full sequence of labels -> ideally we want to assume more dependence between the words => BUT: Words are dependent on other words in the sentence, even given their POS => we want to assume more dependence between the words! 2) a POS tag is independent of past tags given the previous tag => BUT: POS tags have dependencies that cannot be bounded in distance

Answer 29

- > the features in a log-linear model serve the same function as the emission probabilities in HMM - > the higher the f, the higher is prob for a certain y

Answer 30

1. 2. i.e. given the observed variables, how do we find the most likely y = inference, cf. Viterbi algorithm 3. how to find the parameters

Answer 31

you have a chart / lookup table for each word and label and for each pair of labels (?)

Answer 32

you have a feature function

Answer 33

feature-based classifier: if you have completely different sequences of words but have similar features, if that is the case there is a connection between the sequences because of the features -> you have some influence between them even though they are not the same multinomial: you have a chart labels/words where entries are independent. it tells you if you have this word, then it has this value. you don't model any relation.

Answer 34

a feature vector consists of say m features, of which many could be binary features or "indicator function" that ask some question

Answer 35

hill climbing technique for concave functions (which have a local maximum which is also global maximum)

Answer 36

precision: % of selected items that are correct recall: % of correct items that are selected (opposite of precision)

Answer 37

should be able to indentify the ontology (ie the tables) as well as populate the database (does not have a predefined ontology eg thorugh wikipedia infoboxes, it only has text! how does it define relations? just as strings/pieces of text

Answer 38

a first step: Semantic role labeling (SRL) idea: take a sentence, split it into events and propositions and to be able to identify that this relation is expressed and it is very similar to other instances of the same type of relation - > idea: the roles are invariant to the different sentences but stay the same for equivalent values ie. the role is the same regardless of what syntactic position they take - it's kind of an abstraction over the sentence-level paraphrasing which you can do

Answer 39

- An underlying relationship an argument has with the main verb in a clause - This relation is (ideally) invariant to paraphrases - do not necessarily align with syntactic roles

Answer 40

1) FrameNet scheme judgment frame; roles: judge, evaluee, reason; verb "blame" invokes the judgment frame -> very accurate, but coverage very limited (only a few thousands of frames - semantic roles are defined by the frame - a frame does a clustering of verbs (blame, hail, etc have the same frame) - is more difficult to parse - frames are manually defined - a big problem: frame coverage 2) ProBank scheme idea: semantic roles are more fine-grained than in FrameNet - idea: try to abstract over the different instantiations of the same verb - more frequently used than FrameNet, ProBank is more agnostic, does not generalize over verbs at all - more fine-grained: e.g. She blames... or She praises... would be put in same frame with FrameNet but use 2 different semantic roles in PropBank 3) VerbNet scheme - takes max. coarse-grained approach - 22 roles only -> very generalized, sometimes very difficult to pick the right role because very abstractly defined - crux is: use same set of roles across different classes??

Answer 41

you want to abstract away from the syntax to directly represent the semantic relations, e.g. FrameNet (big dictionary of frames), several verbs can invoke the same frame (eg judgment frame); frames are manually defined - SLR is a simplified form of semantic representation

Answer 42

clustering of words that share a meaning component

Answer 43

usually when you separate one prediction task into 2, situation arises called "error propagation": you make a mistake at first stage, then no matter what you do, you're already doomed when you get to the second step -> that's a reason why people don't use pipelines

Answer 44

- information retrieval - machine translation - question answering - summarization

Answer 45

map text to database entries

Answer 46

- key challenge - Language has much underlying structure, which is ambiguously expressed - Ambiguity is at all levels: - > words can sound the same but not mean the same (bank vs. bank) - > Sentences may look the same, but have different meanings - Alongside ambiguity, there is also redundancy: many different ways to say the same thing

Answer 47

is the smallest grammatical unit in a language. In other words, it is the smallest meaningful unit of a language. The linguistics field of study dedicated to morphemes is called morphology. A morpheme is not identical to a word, and the principal difference between the two is that a morpheme may or may not stand alone, whereas a word, by definition, is freestanding. When it stands by itself, it is considered as a root because it has a meaning of its own (e.g. the morpheme cat) and when it depends on another morpheme to express an idea, it is an affix because it has a grammatical function (e.g. the –s in cats to indicate that it is plural). Every word comprises one or more morphemes.

Answer 48

given a sequence of words, what’s the distribution of the next word

Answer 49

An N-gram is a sequence of N N-gram words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”. We’ll see how to use N-gram models to estimate the probability of the last word of an N-gram given the previous words, and also to assign probabilities to entire sequences => N-grams work well only for word prediction if the test corpus looks like the training corpus!

Answer 50

=> simplest Markov model: unigram model -> simply estimate prob of a sequence of words by product of probs by individual words (words are independent in this model)

Answer 51

- Condition on previous single word - slightly more intelligent than unigram model extend further to trigram, 4-gram, 5-gram => in general, insufficient model of language because language has long-distance dependencies

Answer 52

a simplyfing assumption (because we cannot count how many times a sequence of words occurs): -> the next state only depends on the current state and independent of previous history -> P(the | its water is so transparent that =~ P(the | that) or P(the | its water is so transparent that =~ P(the | transparent that) more formally: P(w_i | w_1w_2...w_n) =~ P(w_i | w_i-k...w_i-1) => estimate by just the last few words => simplest Markov model: unigram model -> simply estimate prob of a sequence of words by product of probs by individual words

Answer 53

``` The assumption that the probability of a word depends only on the previous word is called a Markov assumption. => Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past => Markov models assume the probability of a sequence can be given as a product of the transition probabilities ```

Answer 54

Answer 55

- to avoid arithmetic underflow - adding is faster than multiplying p_1 x p_2 x p_3 = log(p_1) + log(p_2) + log(p_3)

Answer 56

A held-out corpus is an additional training corpus that we use to set hyperparameters like these λ values, by choosing the λ values that maximize the likelihood of the held-out corpus

Answer 57

- simplest idea: add-one smoothing

Answer 58

-> Maximum likelihood estimates tend to be too high for low count items => One simple way to handle this is to discount all counts by a constant term: (say, 0.5) But what do we do with the missing probability mass? -> collect it from the number of discounts made and distribute this mass to bigrams with count 0

Answer 59

cf P_continuation

Answer 60

- Representation in a feature space - Scoring by linear functions - Learning by optimizing

Answer 61

the probability of an event ( A ), given that another ( B ) has already occurred

Answer 62

- Simplest Generative Model Naives Bayes is a probabilistic classifier, meaning that for a document d, out of all classes c ∈ C the classifier returns the class c^ which has the maximum posterior probability given the document

Answer 63

the bag of words assumption: the position of the words is ignored => we make use of the frequency of each word

Answer 64

for the training data, you know the labels

Answer 65

or log likelihood the prob of seeing the labels given the inputs

Answer 66

to combat overfitting with log-linear model, we add a regularization term -> idea: to penalize the increase of w especially for low-appearance features

Answer 67

non-linear, non-convex architecture vs. log-linear models: linear, convex

Answer 68

cf. Feed-forward Neural Network

Answer 69

- > a standard way of forming a multi-label classifier | - Softmaxlayers turn vector outputs into a probability distribution

Answer 70

The POS tagging problem is to determine the (single) POS tag for a particular word token (instance)

Answer 71

1. Knowledge of word probabilities (to man is rare) | 2. Knowledge of neighboring words

Answer 72

aim: predict category of each word Approach 1: classify each token independently but use as input features and information about the surrounding tokens (sliding window) (but without category of neighboring words) Approach 2: using Outputs as Inputs: Better input features are usually the categoriesof the surrounding tokens (not easy to integrate), but these are not available yet -> use category of either the preceding or succeeding tokens by going forward or backward and using previous output - > forward classification - > backward classification

Answer 73

- build a model that gives you a prob for the entire sequence - allows integrating uncertainty over multiple, interdependent classifications and collectively determine the most likely global assignment => classic solution: Hidden Markov Model (HMM) -> a global model: means it maximizes over sequences rather than local words only i.e. gives a prob for the entire sequence of words

Answer 74

an HMM allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model

Answer 75

we didn’t observe part-of-speech tags in the world; we saw words and had to infer the correct tags from the word sequence. We call the part-of-speech tags hidden because they are not observed

Answer 76

given x and y_i-1, y_i is conditionally independent of everything that preceded it (simliar to Markov assumptions) -> in MEMM, unlike HMMs, the words are always conditioned on, there is no assumption re the words words are conditionally independent of each other given the class

Answer 77

A Markov chain is a special case of a weighted automaton in which weights are probabilities (the probabilities on all arcs leaving a node must sum to 1) and in which the input sequence uniquely determines which states the automaton will go through -> A Markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world

Answer 78

A Markov chain embodies an important assumption about these probabilities. In a first-order Markov chain, the probability of a particular state depends only on the previous state

Answer 79

- allows us more dependencies between different words (than e.g. HMM) through the use of discriminative models which also allows to use feature representations

Answer 80

- here, I assume that p(y_i|y_i-1,x_1....x_m) is a log-linear model - is the product of transitions - unlike HMM, here I don't assume anything about the independence of the words: I condition on the words, and there is no assumption that words are independent or conditionally independent - 2nd assumption: the structure of each of the factors in the product forms a log-linear model - f = feature function - w = weight vector - the denumerator is to normalize the expression

Answer 81

- multinomial: you have a chart/table/lookup-table for each word and label: coarse-grained - feature function: fine-grained difference: cf. 20171119_120054[1].mp4 26:00ff

Answer 82

why history? - > y_i is conditioned on y_i-1 and all the words = this is called the history - > so it includes everything you condition on, not just observed values but also labels (e.g. the preceding tag)

Answer 83

MEMM: with the feature function we're trying to find relevant information for defining which y's are more probable given everything that I condition on - the features with word/tag pairs in feature function f in MEMM serves the same purpose as the emission probability in HMM (of course! -> if the emission prob is high for a certain y given an x, then the feature function together with the weight vector should also yield a high probability for this y - some features n f correspond to transition probabilities in HMM: eg f_103 considers the two previous tags. You can have both trigram and bigram transition probabilites in f, you don't have to restrict yourself to one of them

Answer 84

05-05 - f_100 corresponds to emission prob e(x_i | y_i) in HMM - f_101 etc aare additional features that HMM could not cover - you can use trigram, bigram, unigram model in the same feature function set - transition feature f_103 corresponds to transition prob q(y_i | y_i-1) in HMM (trigram) and transition feature f_104 same but bigram and transition feature f_105 same but unigram => AND therefore: it corresponds to backoff model smoothing method: if f_103 does not hold then maybe f_104 holds and so forth -> 103 would have very high weight, 104 bit less, 105 again bit less (like a backoff model) => you can combining any number of features! -> becomes more effective

Answer 85

- train it and to do inference - given training data, we want to estimate the parameters (=the weights w), and we need to know how to do inference: given weight values and x, what is the most likely y

Answer 86

pretty much the same as for HMM: use Viterbi sl 05-10/11

Answer 87

- inference (ie finding the max y's): finding a maximum path on the grid representation - > we have w and want to find the most likely y's - learning:

Answer 88

- problem with MEMM, HMM -> occurs with any locally-normalized model - the model favors labels that predict well the next state over labels that predict poorly the next state, it will be biased towards them because of the high probability => your model predicts an unlikely/infrequent sequence of labels just because their probability does not decay exponentially (e.g. long book names) yet that sequence is very rare - could be harmful, not always - e.g. in POS tagging it is usually not a problem

Answer 89

a normalization term as denominator makes sure that the sum of probabilities of all possible values of y_j is 1 equivalent on a grid represenation: the sum of all outgoing edges of a node is 1

Answer 90

inference := the process of inferring something. a conclusion reached on the basis of evidence and reasoning.

Answer 91

- CRFs are similar to MEMMs, but the normalization is global - this solves the label bias: the scores of the next state given the current one need not sum up to 1 - the normalization is not done for each transition but rather the sum of all paths given the x sums up to 1 - each transition is not a probability anymore but rather a score/(or potentials) - you still normalize but over the sum of all the paths - we assume all different paths for a sentence sum up to 1, it might be that the outoging edges of a node do not sum up to 1 - CRF has been standard NLP model until NN came

Answer 92

- it makes sure that if you increase the prob of some path you will decrease the prob of others - it serves as a regularization because otherwise you could assign infinity scores, there would be no limit, no boundaries - > normalization makes you work with a fixed budget, then the problem has a unique maximum that you can find

Answer 93

- how to estimate/train w, how to find the most llikely weights? - no close form formula - write conditional log likelihood and maximize it - > compute gradient 05-17, but has intractable subtrahend term, then transform it st it has a polynomial term Pr(...) (with number of tags squared), and compute this term Pr(...) using dynamic programming for all values y_j and y_j-1 and over all j's

Answer 94

- > similar to Viterbi but instead of max I take the sum - M_i: unnormalized transition weight (not a probability!) to transition from y to y' alpha: is the sum over all paths of length i-1 that end with y (forward term) beta: same but start with y and go all the way to the end (backward term) - how is it helpful? I can compute it (tractable)

Answer 95

- > trained with wall street journal, then apply it to wikipedia - Accuracies degrade outside of domain - e.g. try to find instances you're most confident in - e.g. use lexicons

Answer 96

- Sentiment analysis is the detection of attitudes • Simplest task: Is the attitude of this text positive or negative? • More complex: Rank the attitude of this text from 1 to 5 • Advanced: Detect the target, source, or complex attitude types - a simple classifier: Log-linear or Naïve Bayes classifier

Answer 97

- One practice is to add NOT_ to every word between negation and following punctuation (useful if you want to stick to the bag of words model, but it still fails: "didn't like" would still get a positive polarity - if you have this heursistic of adding NOT_, then instead of "like" you would have a "NOT_like" - > but it is still a very crude solution

Answer 98

- This construal of sentiment analysis attempts to capture the meaning of a word in contextby encoding (parts of) the sentence as features - Recently: using Recurrent Neural Networks (RNNs) - RNNs are FSAs: ) No longer finite in the same sense ) The state is an embedding of the history in a continuous space

Answer 99

- Map from dense sequence to dense representation - > idea RNN: save the history as a state 06-17 - R is parametrized and parameters are shared across steps -> parameters of R are learned, it's the history, everything we've learned so far ~ representation learning -> it obviates the need to create a feature function -> let the system find the features -> it is the NN dogma - with RNN, you don't do any global optimization (you compute a y and move to the next) - RNN: no dynamic programming for inference - > the RNN computes a classification for each word on its own - RNN models are able to take the context (both preceding and following) into account, as well as the linear order between the words (bag of words cannot)

Answer 100

- When using RNNs for sentence classification (such as sentiment analysis), it is often practical to use Bi-RNNs - Bi-RNN: 2 RNNs, one going back to forth and the other forth to back, Output function is a function of both states - This allows the states associated with each word to encode relevant information from the words following them and preceding them - difference from bag-of-words: the result of the latter is just based on what a word is, with RNN it is based on what that word is and the state which encodes both the words preceding it and the words following it - > it is kind of a modified/enhanced bag-of-words

Answer 101

- bottom line: encode the state/output for each word and somehow aggregate them - one simple way to do sentiment analysis (or other sentence classification) with Bi-RNNs is to average the output sequence and train a binary logistic classifier for predicting the sentiment

Answer 102

- eg The rules and principles that guide the structure of sentences in language

Answer 103

Two main motivations: - Structural Analysis:decomposing complex units into simpler sub-structures can assist learning and generalization - Semantic Information: grammatical structure often reflects semantic structure and distinctions - > Grammatical analysis in NLP involves breaking the text into simpler parts and categorizing them - > Categorization allows information to propagate from one type to another (important!) -> allow insight that plain sequences (either a word is close or distant) do not (whereas with categorization you have a tree where words can be in all sorts of configurations to one another even if they are distant)

Answer 104

- Practical: anything that is between two spaces - What about languages that don’t have spaces? - Phonological words - Grammatical words

Answer 105

is the study of words, how they are formed, and their relationship to other words in the same language. It analyzes the structure of words and parts of words, such as stems, root words, prefixes, and suffixes. Morphology also looks at parts of speech, intonation and stress, and the ways context can change a word's pronunciation and meaning

Answer 106

- basic syntactic unit - difficult to define - some kind of description of some activity, state or property - more formally: simplest kind of sentence - a sentence can have several clauses - simple clauses have a predicate, core arguments (usually subjects and objects), and non-core arguments (e.g., location, manner, instrument) - > predicate = some main relation

Answer 107

- basic element of grammar Three main inter-relations between clauses within a sentence: -> complement clauses: Clauses that serve as an argument in another clause •“John said he will kick the ball” -> relative clauses: Clauses that modify an argument of another clause •“John, who has studied the language for years, speaks German well” -> linked clauses: • After he had studied it for years, John could speak German well

Answer 108

- basic element of grammar: 1) noun phrase: - an argument of a clause is generally a noun phrase Two major classes: 1) Proper nouns or names 2) Common nouns - Noun phrases generally have a head (that determines their category, a word the phrase centers around and it determines its type (usually a noun)) -> a word that determines the semantic type of the whole phrase, and is an essential component of its meaning: The *man* who wasn’t there 2) Syntactic Heads: - the (main) verb is considered the head of a clause (in English at least) (take it as a convention) - Arguments and modifiers are considered its dependents: Johnran home (“ran” is the head of the clause, “John” and “home” are its dependents) -> Some schemes view all syntactic relations as head-dependent relations

Answer 109

- syntax is represented as head-dependent relations between words (cf tree structure) - Nodes are words, edges are relations - constituents are subtrees, but the words are all the nodes - The arguments and modifiers of a word are its children - Edge labels mark the type of relation - The verb (=predicate) of the main clause is the root => the assumption: every phrase has a head -> this assumption allows me to build a syntactic tree where edges between words represent bi-lexical dependencies

Answer 110

- the thing asserted (usually the verb) | - the property that you are asserting

Answer 111

- represent the sentence as a collection of nested phrases - > it brackets words into phrases, parent phrases and so on - Words are at the leaves, phrases/constituents are subtrees - Non-terminal labels represent the phrase type. This is determined by the type of the headword (e.g., Noun Phrases (NP) are phrases headed by a noun)

Answer 112

- linear structure might lead us astray: two words might be far away from each other but might have a direct dependency between them ("war horses which were bought by the police like to drink") - same goes for speech tags or sentiment analysis (e.g. negation scope) - -> this syntactic structure helps us achieve all kinds of useful things, do better language modeling, better disambiguation, negation scope - > linear structure is just an approximation - > in order to really model what's going on we need some kind of hierarchy

Answer 113

- phrases (e.g. for machine translation) - > use a phrase-based parse /constituency tree of the English sentence and then change subtrees in order to get order of words in Japanese (tree transformation) (word permutation only won't help here) - > phrase move around in the tree and not single words - > same would be possible with dependency tree (then you'd move subtrees) - extract grammatical relations - > extract subject, object or all events from a sentence - allows to do syntactic disambiguation - > some are detectable directly from the syntactic tree - a proxy to semantics - > by knowing the syntactic tree I already know quite a bit about of information and can answer semantical questions

Answer 114

dependency: - constituents are subtrees, but the words are all the nodes - Verbs are heads of clauses, nouns are heads of noun phrases constituency: - Words are at the leaves, phrases/constituents are subtrees - does not have the notion of a headword, it just puts brackets and has categories but does not tell you the headword - > however, you can extract the headword based on the categories in the tree - > basically it is possible to convert constituency trees into dependency trees (vice versa is trickier)

Answer 115

the head is the word that determines the type of the phrase

Answer 116

cf | 08-syntax_von_pptx_1210.pdf, sl. 30,31

Answer 117

- task of taking a piece of text and generating a dependency tree for it - > what learning algorithm do we use?

Answer 118

- Bi-lexical affinities - Dependency distance (mostly nearby words) - Intervening material (e.g. a verb is unlikely to appear between two dependent word) - valency (how many dependents a word has, e.g. give has two objects)

Answer 119

- accuracy = #correct edges / #number of words | - labeled/unlabeled attachment score

Answer 120

- a "structured prediction problem" | - MST Parser: minimum spanning tree (same algorithm as maximum spanning tree)

Answer 121

MST: minimum spanning tree (same algorithm as maximum spanning tree) 1) build complete directed graph (except ROOT) - > each spanning tree rooted in tree is a possible tree - > gives the space of possible parses - > first assign a weight to each edge how likely it is to appear in the MST - how do I score the edges? - find model that gives weight to edges (whether it is likely or not) based on information that tells us if this edge is relevant - we use feature function phi (usually binary function) and a weight vector to assign scores to edges - usually POS tagging is part of preprocessing for MST parser (then you can use POS tags in the feature function)

Answer 122

inference: is finding the maximum directed spanning tree - > score each edge based on its features (Chu-Liu Edmonds algorithm) learning: - use log-linear model: Pr(T) for all possible trees - using the gradient of the log-likelihood I can find the optimal teta to maximize the log-likelihood - > but computing the gradient is computationally expensive (we have to do for all possible trees for V) - > the common way to do optimization/find the most likely teta is not by computing the gradient but by using the averaged perceptron algorithm

Answer 123

- sl 08-13 - > a (supervised) learning algorithm! (for learning teta) - > that learned teta is used in Pr(T) - > Pr(T) used to find (best/most likely) dependency tree for a sentence - instead of taking the expected value, take the maximum value/tree (done by inference: you compute T' (the MST maximum tree) by using inference, T_i is the correct tree from training data) - learning of teta done by the following: you take the sum over all the edges in the correct tree (gold standard) minus the sum of all the edges in the predicted tree

Answer 124

- normally features are based on single edge alone - higher-order features are types of features that cannot be encoded on edge level alone - MST inference works well when the model factorizes over edges e.g. sibling relations: -> features defined over two edges that have the same parent grandparent features: -> features defined over two edges, where one’s child is the parent of the other valency: -> the number of children a node has. Defined over all edges sharing a parent => formally, the score function factorizes over sets of edges problem: you cannot run the MST algorithm anymore - > if u want take scoring function that is over edge pairs or triplets you cannot run exact inference anymore (becomes intractable), but approximate methods exist for computing inference

Answer 125

- In transition-based parsing, the problem is decomposed to finding a sequence of transitions - we don't assume any grammar or context in transition based parsing

Answer 126

- you have a buffer and a stack and a transition set, you build edges and nodes - transition set T: SHIFT, LEFT-ARC_l, RIGHT-ARC_l - can only parse projective trees - > straightforward way to define a transition system - bottom-up approach

Answer 127

- you have a buffer and a stack and a transition set - more incremental ~ it is more eager to create arcs - transition set T: SHIFT, LEFT-ARC_l, RIGHT-ARC_l, REDUCE - complexity at most 2n - Builds left-dependents bottom-up and right-dependents top-down - arc-standard has to find whole subtree in order to attach it arc-eager does not - we don't know yet how to choose the next transition - arc-eager does top-down and bottom-up parsing

Answer 128

- intuititve: a tree in which there re no crossing arcs - > ie all sub-trees are sub-strings - projective trees much easier to handle (crossing edges make it much more diffficult)

Answer 129

- Transitions operate on the parser configuration (or state): c = (E,B,A) A transition system is defined as: S = (C, T , c_s , C_t ) - transition sequence - > we transition from one configuration to next by transition - transition: a function from configuration to configuration

Answer 130

- want to find the most probable sequence of transitions - in greedy parsing, instead of modeling this joint probability (the whole sequence i...m which maximizes this product sl 09-21), we make the assumption that its enough to choose the best transition each time given what we have so far (you make that decision and can't change it - that's why it's greedy) - a score s(t, c) estimates this probability where t= a candidate transition c= current configuration

Answer 131

argmax_t_1...t_m = PI_i^m = P(t_i | c_i-1)

Answer 132

- a rule-based algorithm, does not need training | - it gets as input a correct tree

Answer 133

- linear transition classifier - NN transition classifier - Bi-RNN encoder

Answer 134

- once wrong transition is picked, then recovery very difficult since parsing is greedy

Answer 135

- compromise between greedy and exhaustive search | - k-beam

Answer 136

- An exponential number of options per sentence - Ambiguity - Many many rules to learn (exceptions) - Language speakers have no explicit knowledge of the rules of grammar

Answer 137

- you get annotated trees (eg from treebank) and you derive the parsing ruls from them

Answer 138

- not a deterministic parse - think of it as a multinomial (each symbol has 1-n possible derivations) - each tree has a probability - independence assumption: The problem is that the independence assumptions are very strong: Each production is independent of any other production e. g.: - All clauses have the same distribution

Answer 139

- the two representations are equivalent

Answer 140

- rules are context-free: they ignore context | - each rule is independent of the others

Answer 141

you want to find the tree with the highest score

Answer 142

- A common binarizationis the Chomsky normal form, which is a grammar where each production is either of the form A -> B C or A -> w, where A, B and C are pre-terminals and w is a terminal - > often useful to treat all CFG rules as binary - > every CFG grammar can be converted to a CNF, which includes the same sentences

Answer 143

- input assumed to be in CNF - > a recognizer that will only determine if a sentence is in the language - > It is simple to extend it into a parser that also constructs a parse tree

Answer 144

- to overcome attachment ambiguity - trees are only different in terms of different rules applied -> nothing relates the words its only unlexicalized rules => a deficiency of the vanilla PCFG -> head lexicalization is a standard way to overcome this -> assume each constituent has a headword -> add headwords to each phrasal node -> now you have more info at the rule node because headword is most informative word of subtree/that constituent (*) -> I'll need to retrain my PCFG but it will cause huge sparsity problems: not just only so many eg 15 non-terminals, no I have 15 times the size of the vocabulary -> has horrible sparsity => what to do now? use eg the Collins parser bottom line: Lexicalization can resolve ambiguities and lead to huge boost in performance - the cost is huge sparsity (*) The head word of a phrase gives a good representation of the phrase’s structure and meaning

Answer 145

the most informative word/element in the subtree

Answer 146

- keep the head lexicalization - to get over sparsity (introduced by head lexicalization), let's make more (conditional) independence assumptions - > each rule is independent of the other - > each rule is not just a multinomial, it is a product of several terms - > no dependency between different siblings => This model still has many parameters, but it reduces each of the terms in the product to depend on 2-3 lexicalized categories, instead of n+m+2 for a regular lexicalized PCFG of this model (cf. 10-36)

Answer 147

- paradigm of supervised learning: you have label data (you know the right answer) - clustering: main techniques for generalization without having label set - want to cluster words according to their part of speech without having annotated data - discriminative models would not make sense: we don't have labels (relation between labels and observed features not possible to work with) - generative model can be useful

Answer 148

Given a large corpus of raw text, partition the words into categories with respect to their grammatical function POS tags are context-dependent -> makes the prediction much more difficult (e.g. the word 'back') -> words will receive one POS tag only but if the most likely is taken the main tag will account for more than 90% of the probability mass (type-level clustering)

Answer 149

-> words will receive one POS tag only but if the most likely is taken the main tag will account for more than 90% of the probability mass (type-level clustering)

Answer 150

- adjectives have similar distribution (approx. definition of a POS tag) - nouns also -> POS Induction through PCA over word co-occurrence Matrices

Answer 151

tokens are instances of type

Answer 152

- another type-level POS induction algorithm | -

Answer 153

we label the entire sequence

Answer 154

each class is assigned exactly one cluster

Answer 155

a class may be assigned to multiple clusters

Answer 156

- simplest Markov model | - It models the state of a system with a random variable that changes through time

Answer 157

In discriminative models we often talk about the conditional likelihood

NLP Flashcards

(221 cards)