Quiz 4 - Module 3 Flashcards

1
Q

LSTM output gate (ot)

A
  • result of affine transformation of previous hidden state and current input passed through sigmoid
  • modulates the value of the hidden state
  • decides how much of the cell state we want to surface
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

RNN Language Mode: Inference

A
  • Start with first word, in practice use a special symbol to indicate new sentence
  • Feed the words in the history until we run out of history
  • Take hidden state h, transform
    • project h into a high dimensional space (same dimension as words in vocabulary)
  • normalize transformed h
    • use softmax
  • result: probability distribution of believed next work for model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are graph embeddings useful?

A
  • task-agnostic entity representations
  • features are useful on downstream tasks without much data
  • nearest neighbors are semantically meaningful
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Contexualized Word Embeddings Algorithms

A

elmo, bert

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The most standard form of attention in current neural networks is implemented with the ____

A

Softmax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Many to many Sequence Modeling examples

A
  • speech recognition
  • optical character recognition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Token-level tasks

A
  • ex: named entity recognition
  • input a sentence without any masked tokens + positions, go through transformer encoder architecture, output classifications of entities (persons, locations, dates)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Steps of Beam Search Algorithm

A
  • Search exponential space in linear time
  • Beam size k determines width of search
  • At each step, extend each of k elements by one token
  • Top k overall then become the hypthoses for next step
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Self-Attention improves on the multi-layer softmax attention method by ___

A

“Multi-query hidden-state propagation”

Having a controller state for every single input.

The size of the controller state grows with the input

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Scarcity Issues

A
  • Language Similarity missing
    • language is different from source (ie. not similar to english like spanish/french are)
  • Domain incorrect
    • ie. medical terms not social language
  • Evaluation
    • no access to real test set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Many to One Sequence Modeling examples

A
  • Sentiment Analysis
  • Topic Classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Attention

A

Weighing or probability distribution over inputs that depends on computational state and inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Differentiably Selecting a Vector from a set

A
  • Given vectors {u1, …, un} and query vector q
  • The most similar vector to q can be found via softmax(Uq)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Alignment in machine translation

A

For each word in the target, get a distribution over words in the source

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Graph embeddings are a form of ____ learning on graphs

A

unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What makes Non-Local Neural Networks differ from fully connected Neural Networks?

A

Output is the weighted summation dynamically computed based on the data. In fully connected layer, the weights are not dynamic (learned and applied regardless of input).

Similarity function in non-local neural network is data dependent. Allows the network to learn the connectivity pattern and learn for each piece of data what is important and then sum up the contribution across those pieces of data to form the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Distribution over inputs that depends on computational state and the inputs themselves

A

Attention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Roll a fair die and guess. Perplexity?

A

6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

T/F:Softmax is useful for random selection

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Recurrent Neural Networks are typically designed for ____ data

A

sequential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Sequence Transduction

A

Sequence to Sequence (Many to Many Sequence Modeling)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what information is important for graph representations?

A
  • state
    • compactly representing all the data we have processed thus far
  • neighborhood
    • what other elements to incorporate?
    • selecting from a set of elements with similarity or attention
  • propagation of info
    • how to update info given selected elements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What dominates computation cost in machine translation

A

Inference

  • Expensive
    • step-by-step computation (auto-regressive, predict diff token at each step)
    • output projection (vocab * output * beam size)
    • deeper models
  • Strategies
    • smaller vocabs
    • more efficient computation
    • reduce depth/increase parallelism
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What allows information to propagate directly between distant computationl nodes while making minimal structural assumptions?

A

The attention algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Current (Standard) Approach to (Soft) Attention

A
  • Take a set of vectors u1,…un
  • Inner product each of the vectors with controller q
    • unordered set
  • Take the softmax of the set of numbers to get weights aj
  • The output is the product of the weights aj and the inputs uk
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How to evaluate word embeddings

A
  • intrinsic
    • evaluation on a specific/intermediate subtask
      • ex - nearest neighbor of a particular word vector
    • fast to compute
    • helps to understand the system
    • not clear if really helpful unless correlation to real task is established
  • extrinsic
    • evaluation on real task
    • can take a long time to compute
    • unclear if the subsystem is the problem or its interaction
    • if replacing exactly one subsystem with another improves accuracy -> winning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

RNNs, when unrolled are just ____ with _____ transformations and ____

A

RNNs, when unrolled are just feed-forward Neural Networks with affine transformations and nonlinearities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How do we define the probability of a context word given a center word?

A

Use the softmax on the inner product between the context word and inner word. Both words are represented by vectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Graph Embedding

A

Optimize the objective that connected nodes have more similar embeddings than unconnected via gradient descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Multi-layer Soft Attention

A

Layers of attention where each is input the output of the previous attention layer. The controller q is the hidden state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

LSTM input gate (gt)

A
  • result of affine transformation of previous hidden state and current input passed through sigmoid
  • decides how much the input should affect tthe cell state
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Bleu score

A

Precision-based metric that measures n-gram overlap with a human reference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

fastText

A

sub-word embeddings

Add info to word2vec which better handles out of vocabulary words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Word2Vec Idea/Context

A
  • Idea - use words to predict their context words
  • Context - a fixed window of size 2m
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Applications of Language Modeling

A
  • predictive typing
    • in search fields
    • for keyboards
    • for assisted typing, e.g. sentence completion
  • automatic speech recognition
    • how likely is user to have said “my hair is wet” vs “my hairy sweat”
  • basic grammar correction
    • p(They’re happy together) > p(Their happy together)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Non-Autoregressive Machine Translation

A

Model generates all the tokens of a sequence in parallel resulting in faster generation speed compared to auto-regresive models, but with cost of lower accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Conditional Language Modeling

A

Condition the language modeling equation on a new context, c

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Hierarchical Compositionality for NLP

A

character -> word -> NP/VP/… -> clause -> sentence -> story

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Flip a fair coin and guess. Perplexity?

A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Total loss for knowledge distillation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

T/F: Attentions are soft, not hard, where the distribution is used directly as a weighted average.

A

False - Attention can be soft or hard.

  • Hard - where samples are drawn from the distribution over the input
  • soft - where the distribution is used directly as a weighted average
42
Q

Important property of attention as a layer

A

Representational power grows with the size of the input

43
Q

Standard specializations of Transformers for text

A
  • Position encodings depending on the location of a token in the text
  • For language models: causal attention
    • graph structure of a sequence (exclude things that dont go L -> R)
  • Training code outputs a prediction at each token simultaneously (and takes a gradient at each token simultaneously)
    • multiplies training speed by the size of the context
44
Q

What architecture was created to attempt to alleviate the vanishing gradient problem?

A

Long Short-Term Memory (LSTM) Network

45
Q

Masked Language Modeling

A
  • Take as input a sequence of words
  • Cover up some words with special tokens ()
  • Take word embeddings with mask (+ positional embeddings) and feed to transformer encoder
    • No notion of position of inputs, hence positional embeddings added
  • Make predictions of masked words
  • Give a significant boost in performance
46
Q

Cross-lingual Masked Language Modeling

A
  • don’t have to stick to a single language
  • join two sentences in a sequence (english and french) with a separator between
  • Mask the word(s) of interest in both languages
  • Add position and language embeddings to the word embeddings
  • Feed through transformer encoder architecture
  • Make predictions of the masked words
  • Strength: cross-lingual transfer
47
Q

Quantization

A

Speed up by performing matrix multiplication in a smaller precision domain

48
Q

Cross-lingual transfer

A

Take a pre-trained model and further train it with classificaiton of english data. Model would also be able to classify in other languages even if the training language was only english

49
Q

Byte-pair encoding

A

Like compression, where most frequent adjacent pair is iteratively replaced

50
Q

Masked Language Modeling is considered a _____ task

A

Masked Language Modeling is a pre-training task

51
Q

T/F: Softmax is differentiable

A

True - why it is the mechanism of choice for attention

52
Q

Hyperbolic Embeddings

A
  • Learning hierarchical representations by embedding entities into hyperbolic space
  • Discover hierarchies from similarity measurements
  • Needs less dimensions than word2vec (around 2)
  • in the circular shape, more detailed objects are on the perimeter
53
Q

Neural Machine Translation:

The probability of each output token estimated separately (left-to-right) is based on:

A
  • Entire input sentence (encoder outputs)
  • All previously predicted tokens (decoder “state”)
54
Q

Loss function for student (knowledge distillation)

A
55
Q

Differentiably Selectig a Vector from a set

A
56
Q

In Neural Machine Translation, argmax p(t | s) is considered ____

A
  • intractable
    • exponential search space of possible seqs
    • estimated by beam search
      • size: 4 to 6
57
Q

Teacher Forcing

A

Using the actual word instead of the predicted word to feed to the next time step of the RNN. It allows the model to keep training effectively even if it would have made a mistake in previous time steps.

58
Q

What is the objective function for Skip-gram?

A

(average) negative loglikelihood

59
Q

Multi-head attention

A
  • Combines multiple attention heads being trained in the same way on the same data - but with different weight matrices, and yielding different values
  • Each of the L attention heads yields values for each token - these valures are then multiplied by trained parameters and added
60
Q

Cross Entropy

A

The expected number of bits required to represent an event drawn from the reference distribution (p*) when using a coding scheme optimal for p

61
Q

LSTM forget gate (ft)

A
  • result of affine transformation of previous hidden state and current input passed through sigmoid
  • decides how much of previous cell state to keep around
  • ft = 0, forget everything
  • ft = 1, remember everything
62
Q

Applications of Conditional Language Modeling

A
  • Topic aware language model
    • c = the topic, s = the text
  • Text summarization
    • c = long document, s = summary
  • Machine Translation
    • c = french, s = english
  • Image captioning
    • c = image, s = caption
  • Optical character recognition
    • c = image of a line, s = its content
  • speech recognition
    • c = recording, s = content
63
Q

What is the problem with modeling sequences with Multi-layer perceptrons?

A
  • Cannot easily support variable-sized sequences as inputs
  • Cannot easily support variable-sized sequences as outputs
  • No inherent temporal structures
    • no notion that input 1 comes before input 2
  • No practical way of holding state
    • does not generalize when words change order
  • The size of the network grows with the maximum allowed size of the input or output sequences
64
Q

T/F: Embeddings of different types (page, video, or word embeddings) can be combined to perform one task

A

True

65
Q

Vocabulary Reduction

A
  • Not all tokens likely for every input sequence
  • IBM alignment models using stats to model probabilities of translation
  • lexical probs can be used to predict most likely output for given input
66
Q

Distributional Semantics

A

A word’s meaning is given by the words that frequently appear close-by

67
Q

Perplexity

A

Geometric mean of the inverse probability of a sequence of words according to the model.

Perplexity of a discrete uniform distribution over k of n’s is k.

68
Q

Loss function for distillation (knowledge distillation)

A

Cross-entropy between teacher and student or KL divergence

69
Q

Softmax is permutation ____

A

Softmax is permutation equivariant

A permutation of the input leads to the same permutation of the output

70
Q

Language Modeling

A

Allows us to estimate probabilities of sequences (ex: p “I eat an apple”) and let us perform comparisons

p(s) = p(w1, w2, …, wn)

= p(w1) p(w2|w1) p(w3|w1,w2)…p(wn | wn-1, …, w1)

71
Q

Per-word Cross-entropy

A

Cross entropy averaged over all words in the sequence. Where the reference distribution is the emperical distribution of words in the sequence. This is commonly used as a loss function.

72
Q

LSTM candidate update (ut)

A
  • result of affine transformation of previous hidden state and current input passed through tanh
  • new information coming from the input we’ve just seen
  • modulated by the input gate
73
Q

Truncated backpropagation through time

A

Updates become expensive, so states are carried forward forever but only backpropagate for a fixed set of states.

74
Q

Non-Local Neural Networks

A
75
Q

Knowledge Distillation

A
  • Have pretrained model (teacher)
    • too slow or too expensive
  • Add a student model
  • Both teacher and student model perform (soft) prediction
  • Difference in loss between teacher and student is distillation loss
  • Student model
    • minimize distillation loss
    • minimize student loss
76
Q

Selecting a vector from a set

A
77
Q

T/F: At each timestep for an RNN, the predicted word by the network is fed to the next timestep

A

False - The true word is fed to the next timestep, not the predicted word. “Teacher Forcing”

78
Q

RNN components

A

xt - input at time t

ht - state at time t

ht-1 - state at time t - 1

f_theta - cell

The RNN is recursive with state being passed at each time step to the next one

79
Q

How to feed words to an RNN?

A
  • One hot vector representation of all words in vocabulary
80
Q

The more general way to look at embeddings

A
  • Graphs
    • node > vector
    • optimize objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent
  • Words
    • word > vector
81
Q

Vanilla (Elman) RNN

A
82
Q

T/F: Neural translation quality changes linearly

A

False - Small change can cause catastrophic error

83
Q

Translation is often modeled as a _____

A

Conditional language model

P(t | s ) = P(t1 | s) * … P(tn | t1, …, tn-1, s)

84
Q

How do the query q, vectors {u1,…,un}, and distributions in softmax attention differ from that in an MLP?

A
  • Softmax at the final layer of a MLP
    • q is the last hidden state
    • {u1,…,un} is the embedings of the class labels
    • samples from the distribution corresponds to labelings (outputs)
  • Softmax Attention
    • q is an internal hidden state
    • {u1,…,un} is the embeddings of an input (ie. previous layer)
    • distribution correspond to a summary of {u1,…,un}
      • a weighted summary of u
85
Q

What causes the vanishing gradient problem in Vanilla RNNs?

A
86
Q

Word2vec

A

Efficient Estimation of Word Representations in Vector space

87
Q

Word embedding evaluation

a:b :: c: ?

Is an example of a ___ word embedding evaluation

A

intrinsic

Evaluate word vectors by how well their cosine distance after addition captures inuitive semantic and syntactic analogy questions

88
Q

Graph Embeddings

A

Optimze the objective that connected nodes have more similar embeddings than unconnected nodes via gradient descent

89
Q

Graph Data examples

A

Knowledge Graphs

Recommender Systems

Social graphs

90
Q

Word2vec: the skip-gram model

A
  • Word Embeddings
    • Idea - use words to predict their context words
    • context - a fiexed window of size 2m
91
Q

GloVe

A

Global Vectors

Training of the embedding is performed on aggregated global word cooccurance statistics from a corpus.

92
Q

What are less computationally expensive alternatives for the inner product softmax in Word2Vec?

A

Hierarchical Softmax

Negative Sampling

93
Q

Attention-Based Networks can _________ in an ordered/arbritary set

A

up or down weight a whole range of elements

94
Q

Language Models are ____ models of language

A

generative (can make new sequences of words based off of conditional probabilities)

95
Q

T/F: Graph embeddings are a task specific entity representation

A

False - Task-agnostic

96
Q

Sentence-level tasks

A
  • ex: sentence-level tasks (sentiment analysis)
  • input a sentence without any masked tokens + positions, go through transformer encoder architecture, output global meaning of sentence
97
Q

Evaluting LM Performance

A
  • Cross Entropy
98
Q

Embedding

A

A learned map from entities to vectors of numbers that encodes similarity

99
Q

Collobert and Weston vectors

A

A word and its contet is a positive training sample; a random word in that sample context gives a negative training sample

100
Q

T/F: The softmax composed of the inner product between two word vectors is expensive to compute

A

True.