SLP Flashcards

(107 cards)

1
Q

What is logistic regression called for more than two classes?

A

multinomial logistic regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the most important difference between naive Bayes and logistic regression?

A

logistic regression is a discriminative classifier, naive Bayes is a generative classifier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Generative model

A

A model that should generate what its trying to learn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Discriminative model

A

A model that only tries to learn to distinguish the classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Naive Bayes assigns a class c to a document d. How does it do this?

A

It computes a likelihood : P(d|c) and a prior: P(c).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the four components of a machine learning system for classification?

A
  1. feature representation of the input
  2. classification function that computes estimated class
  3. objective function that we want to optimize
  4. algorithm for optimizing the objective function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two phases of logistic regression?

A
  1. training
  2. test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the bias term?

A

A real number that is added to the weighted inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how does a logisitc classifier make a decision on a test instance after learning the weights?

A

it multiplies each xi<\sub> by its weight wi<\sub>, sums the weighted features and adds the bias term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the formula for z in logistic regression?

A

z = w*x + b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does z express in logistic regression?

A

the weighted sum of the evidence for the class it is computed for

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do we to do z in logistic regression to create a probability?

A

Pass it through the sigmoid function (logistic function)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

exp(x) ==

A

e^x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the sigmoid function?

A

1 / (1+ exp(-z) )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the sigmoid function do to the real number z?

A

it maps it into the range (0,1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

logit

A

the input to the sigmoid function:
z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why is the input to the sigmoid function often called logit?

A

it’s the inverse of the sigmoid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the logit function logit(p)?

A

sigma^(-1) (p)

ln(p/(1-p))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Period disambiguation

A

Deciding if a period is the end of a sentence or part of a word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Representation learning

A

Ways to learn features automatically in an unsupervised way from the input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does it mean to standardize input values?

A

Rescale them so they have zero mean and a standard deviation of one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the formula to normalize input features values to lie between 0 and 1?

A

(xi<\sub> - min(xi<\sub>) / (max(xi<\sub>) - min(xi<\sub>))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is softmax regression another name for?

A

multinomial logistic regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

hard classification

A

when an observation can not be in multiple classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
one-hot vector
a vector with one value=1 and the rest of its values =0
26
What does the softmax function do?
It takes a vector z of K arbitrary values and maps them to a probability distribution in range (0,1).
27
loss function
the distance between the system output and the gold output
28
What type of loss function is commonly used for logistic regression and neural networks?
the cross-entropy loss
29
What algorithm is standardly used for iteratively updating the weights to minimize the loss function?
gradient descent
30
conditional maximum likelihood estimation
a loss function that prefers the correct class labels of the trianing examples to be more likely.
31
What does conditional maximum likelihood estimation do?
It chooses the parameters w,b that maximize the log probability of the true y labels in the training data given the observations x.
32
bernoulli distribution
observations where there are only two discrete possible outcomes
33
What is the goal of gradient descent?
to find the optimal weights to minimize the loss function we have defined for the model
34
how do we refer to the set of parameters learned by the model?
theta
35
convex function
has at most one minimum
36
Why is stochastic gradient descent called stochastic?
because it chooses a single random example at a time
37
L2 regularization
a quadratic function of the weight values
38
L1 regularization
a linear function of the weight values
39
What does L2 regularization prefer?
weight vectors with many small weights
40
What does L1 regularizaiton prefer?
sparse solutions with some larger weights but many more weights set to zero
41
What does L1 regularization lead to?
sparser weight vectors and fewer features
42
derivative of ln(x)
1/x
43
derivative of the sigmoid
sigma(z) * (1-sigma(z))
44
chain rule of derivatives: derivative of u(v(x))
du/dv * (dv/dx)
45
How is the syntactic structure of a sentence described in dependency formalisms?
In terms of directed binary grammatical relations between the words.
46
Typed dependency structure
A dependency structure where the labels of relations among words are drawn from a fixed inventory of grammatical relations.
47
Why are dependency grammars more common than constituency grammars in NLP, looking at semantics?
The head-dependent relations are a good proy for the semantic relationship between predictes and their arguments.
48
What are the arguments in a grammatical relation?
a head and a dependent
49
What is the head in a grammatical relation
the central organizing word
50
What is the dependent in a grammatical relation?
A kind of modifier
51
Give examples of grammatical functions
subject, direct object, indirect object...
52
UD (abbreviation)
universal dependencies
53
What is the UD project?
An open community effort to annotate dependencies and other aspects of grammar across more than 100 languages.
54
How can we divide the core set of frequently used grammatical relations of UD in two sets?
clausal relations describe syntactic roles with respect to a predicae and modifier relations categorize the ways that words can modify their heads.
55
How can we formally represent a dependency structure?
As a directed graph G = (V,A) with vertices V and ordered pairs of vertices A, called arcs.
56
What does the set of arcs (A) in a dependency structure capture?
the head-dependent and grammatical function relationships between the elements in the set of vertices V.
57
A dependency tree is a directed graph that satisfied the following 3 constraints:
1. has a single designated root node without incoming arcs 2. each vertex has exactly one incoming arc (except for the root node) 3. there is a unique path from the root node to each vertex in V.
58
What do the 3 constraints for a dependency tree (=directed graph) ensure?
Each word has a single head, the dependency structure is connected and there is a single root node from which there is a unique path to each of the words in the sentence.
59
When is an arc from a head to a dependent projective?
If there is a path from the head to every word that lies between the head and the dependent in the sentence.
60
When is a dependency tree said to be projective?
If all the arcs that make it up are projective.
61
How can you detect projectivity in a dependency tree when drawing it?
It is projective if there are no crossing edges.
62
Describe transition-based parsing in general:
There is a stack on which we build the parse, a buffer of tokens to be parsed and a parser taking actions on the parse via a predictor called an oracle.
63
What does the parser do in transition-based parsing?
It walks through the sentence left-to-right and shifts items from the buffer onto the stack.
64
What happens to the items that the parser shifts from the buffer onto the stack in transition-based parsing?
At each time point the top two elements on the stack are examined and the oracle makes a decision about what transition to apply to build the parse.
65
What are the possibe transitions that the oracle can do in transition-based parsing?
1. assign the current word as head of a previously seen word. 2. assign a previously seen word as the head of the current word. 3. postpone dealing with the current word and store it for later processing
66
What is the LEFTARC transition operator in transition-based parsing?
It asserts a head-dependent relation between the word at the top of the stack and the second word, and removes the second word from the stack.
67
What does the RIGHTARC transition operator do in transition-based parsing?
It asserts a head-dependent relation between the second word on the stack and the word at the top, it then removes the top word from the stack.
68
What does the SHIFT transition operator do in transition-based parsing?
It removes the word from the front of the input buffer and pushes it onto the stack.
69
What operations are called reduce operations in transition-based parsing?
LEFTARC and RIGHTARC, because reducing means combining elements on the stack.
70
When can the LEFTARC operator not be applied in transition-based parsing?
When ROOT is the second element of the stack. (because the root node cannot have incoming arcs)
71
What do the LEFTARC and RIGHTARC operators require in transition-based parsing?
That there are two elements on the stack.
72
Arc standard approach to transition-based parsing
Where the transition operators only assert relations between elements at the top of the stack, and once an item is assigned, its head is removed from the stack and unavailable for further processing.
73
What is a configuration in transition-based parsing?
The current state of the parse: the stack, an input buffer of words/tokens, and a set of relations representing a dependency tree.
74
What kind of algorithm is transition-based parsing?
A greedy algorithm running in linear time (length of the sentence)
75
How is the oracle in transition-based learning generally created?
By supervised machine learning methods, using configurations annotated with the correct transition to make (drawn from dependency trees).
76
In training an oracle, a training parser goes through the sentence while knowing the correct dependency tree. Why is there an extra restriction on using the RIGHTARC operation in training?
To ensure that a word is not popped from the stack and lost for further processing before all its dependents have been assigned to it. (restriction = use only if all of the dependents of the word at the top of the stack have already been assigned)
77
What has the oracle access to during training? Describe formally:
1. A current configuration with a stack S and a set of dependency relations Rc. 2. A reference parse consisting of a set of vertices V and a set of dependency relations Rp.
78
What two classifiers for choosing transitions are introduced for transition-based parsing?
1. classic feature-based algorithm 2. neural classifier using embedding features.
79
What is the feature template in transition-based parsing with feature-based learning?
(s1*w, op), (s2*w, op) (s1*t, op), (s2*t, op) (b1*w, op), (b1*t, op)(s1*wt, op) with s=stack, b=word buffer, w=word forms, t=part-of-speech and op=operator.
80
How is the feature ofr the word form at the top of the stack denoted in the feature template in feature_based learning, transition-based parsing?
s1*w
81
What loss function used in transition-based parsing with neural classifier?
cross entropy loss
82
How does transition-based parsing with a neural classifier generally work?
Pass the sentence through an encoder, take the representation of the top 2 words on the stack & the first word on the buffer concatenete these and present to feed-forward network that predicts transition.
83
What kind of search strategy does beam search use?
A breadth-first search strategy with a heuristic filter that makes sure that the search frontier stays within a fixed-size beam width.
84
Why are graph-based parsing methods more accurate than transition-based parsers?
They score entire trees rather than relying on greedy local decisions. They can also produce non-projective trees.
85
What does it mean if a score is edge-factored?
the overall score for a tree is the sum of the scores of each of the scores of the edges that comprise the tree.
86
What two problems do graph-based algorithms have to solve?
1. assigning a score to each edge 2. finding the best parse tree given the scores of all potential edges
87
What is the first step in graph-based parsing when given a sentence S?
Create a graph G which is a fully-connected, weighted, directed graph where vertices are input words and the directed edges represent all possible head-dependent assignments.
88
What do the weights of each edge in the initial graph G in graph-based parsing represent?
the score for each possible head-dependent relation assigned by some scoring algorithm.
89
What is finding the best dependency parse for sentence S equivalent to in graph-based parsing?
finding the maximum spanning tree over G
90
Are the absolute values of the edge scores in graph-based parsing critical for determining the maximum spanning tree?
No the relative weights of the edges entering each vertex matter.
91
EM metric
exact match metrix = how many sentences are parsed correctly
92
Labeled attachment
the proper assignment of a word to its head along with the correct depdendency relation = LAS
93
Unlabeled attachment
the correctness of the assigned head, ignoring the dependency relation = UAS
94
LS (abbreviation)
label accuracy score
95
label accuracy score
the percentage of tokens with correct labels, ignoring where the relations are coming from.
96
Distributional hypothesis
Aspects of meaning can be learned solely from the texts we encounter over our lives, based on the complx association of words with the words they co-occur with.
97
Pretraining
Learning knowledge about language and the world from vast amounts of text
98
Large Language Models (LLMs)
the pretrained language models resulting from pretraining
99
What does generative AI consist of?
Text generation, code-generation and image-generation
100
Conditional generation
The task of generating text conditioned on an input piece of text
101
Greedy decoding
Generating the most likely word given the context
102
How is the output yt chosen at each time step t in greedy decoding generation?
By computing the probability for each possible output (words in vocab) and then choosing the highest probability word (argmax).
103
Decoding
the task of choosing a word to generate based on the model's probabilities
104
Autoregressive generation
Repeatedly choosing the next word conditioned on the previous choices
105
Causal language modeling ==
autoregressive generation
106
Sampling from a model's distributions over words means that we...
choose random words according to their probability assigned by the model, so iteratively choose a word to generate according to its probability in context
107