Quiz 4 Flashcards

1
Q

RNN forward update rule

A
  1. At each time step, input and hidden are fed through a linear layer.
  2. Followed by non-linear (tanh)
  3. Multiplied by output linear layer to get logits
  4. Softmax to get probabilities
  5. Get CE loss

Formally:
U, V, W = weights for input-to-hidden, hidden-to-output, hidden-to-hidden
a_t = Ux_t + Wh_{t-1} + b
h_t = tanh(a_t)
o_t = Vh_t + c
y_hat_t = softmax(o_t)
Lt = CE(y_hat_t, y_t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

RNN: How is loss calculated over entire sequence

A

Loss at each time step is summed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RNN: True or False. hidden weights are shared across the sequence.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

RNN: Advantage of sharing parameters across sequence

A

Sharing allows generalization to sequence lengths that did not appear in the trainings set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RNN: RNN architecture must always pass the hidden state to next sequence

A

False. Goodfellow book show examples where the output from t-1 is passed to the hidden layer at t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

RNN: RNN architecture that passes only the output to next time step is likely to be less powerful.

A

True. If hidden state is not passed, it lacks important information from the past.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Vanishing gradient

A

Gradient diminishes as they are backpropagated through time, leading to no learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Exploding gradient

A

Gradient grow as they are backpropagated through time, leading to unstable learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

RNN cons

A
  1. Vanishing gradeints
  2. Exploding gradients
  3. Limited “memory” - Can handle short-term dependencies but can’t handle long term
  4. Lack control over memory - unlike LSTM, does not have mechanism to control what information should be kept.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

RNN: Advantage over LSTM

A

Smaller parameter size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

LSTM update rule - components and flow

A

Sequence (FICO):
1. Forget gate
2. Input gate
3. Cell Gate
4. Output gate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

LSTM: What state is passed throughout the layer

A

Cell state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

LSTM: What does forget gate do

A

Take input and hidden state, decides what information from the cell state should be thrown away or kept.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

LSTM: What does input gate do

A

Updates the cell state with new information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

LSTM: What does cell gate do

A

Combines information from forget gate, input gate, and candidate cell state to output new cell state at time step t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

LSTM: What does output gate do

A

Decides the next hidden state based on the updated cell state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

LSTM: Pros over RNN

A

Controls the flow of the gradient so that they neither vanish or explode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

RNN: What is a recursive neural network (RecNNs, not RNN!), and what are its advantages

A
  1. Can handle hierarchical structure
  2. Reduce vanishing gradients by having nested, shorter RNNs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

GRU: Main difference to LSTM

A

Single gate controls forget and cell state update.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Gradient clipping: main use

A

Control exploding gradients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why does MLP not work for modeling sequences

A
  1. Can’t support variable-sized inputs or outputs
  2. No inherent temporal structure
  3. Can not maintain “state” of sequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

LM: Define Language Models

A

Estimate probabilities of sequences that allow us to perform comparison.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

LM: How are probabilities of an input sequence of words calculated?

A

Chain rule of probability.

Probability of sentence = Product of conditional probabilities over i, which indexes our words.

p(s) = p(w_1) p(w2 | w1) p(w3 | w1, w2) … p(w_n | w_n-1 … W1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

LM: 3 applications of language modeling

A
  1. predictive typing
  2. ASR
  3. grammar correction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
LM: How is "conditional" language modeling different
Adds an extra context "c" to the chain rule of probability: p(s | c) = product of p(w_i | c, w_i-1 ... w_1)
26
Conditional LM: What is context and sequence for Topic-aware language model
C = topic S = text
27
Conditional LM: What is context and sequence for Text summarization
C = long document S = summary
28
Conditional LM: What is context and sequence for Machine Translation
C = French text S = English text
29
Conditional LM: What is context and sequence for Image captioning
C = image S = caption
30
Conditional LM: What is context and sequence for OCR
C = image of a line S = its content
31
Conditional LM: What is context and sequence for Speech Recognition
C = recording S = its transcription
32
Teacher forcing
During training, the model is also fed with the true target sequence, not its own generated output.
33
Knowledge distillation
Smaller model (student) learns to mimic the predictions of a larger, more complex model (teacher) by transferring its knowledge
34
T or F: Hard labels are passed from teacher model to student
False. Soft labels gives more signals than hard.
35
Knowledge distillation loss components: Teacher-student loss
CE loss between prediction scores between teacher / student can be hard or soft labels
36
Knowledge distillation loss components: Student loss
CE loss between student prediction and ground truth
37
Knowledge distillation loss components: Combined loss
teacher-student loss * weights + student loss * weights
38
Define cross-entropy loss
Expected number of bits required to represent an event from reference distribution (P*) when using a coding schema optimal for P.
39
Define per-word cross-entropy
Cross-entropy average over all the words in a sequence
40
What is the reference distribution (P*) in per-word cross entropy?
Empirical distribution of the words in the sequence
41
Define perplexity
Geometric mean of the inverse probability of a sequence of words.
42
What is perplexity of choosing 1 for a 10 sided dice?
10 (discrete uniform distribution of k is k)
43
Define perplexity using law of logarithms
Perplexity is the log of per-word cross-entropy
44
Define pretraining task
A task not specifically for the final task, but can help us achieve getting better initial parameters for modeling the final task.
45
Masked language models: Key idea
Model learns to predict masked tokens of a sequence.
46
Embeddings: Distributional semantics
A word's meaning is given by the words that frequently appear close-by
47
Embeddings: Key idea of "A Neural Probabilistic Language Model" Bengio, 2003
Map feature vector to each word in the vocabulary to predict next word.
48
Embeddings: Key idea of "A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning" Collobert & Weston, 2008 & "Natural Language Processing (Almost from Scratch)" Collobert et al., 2011
Use CNN to create word embeddings KNN of vectors show similar syntax + semantics
49
"Efficient Estimation of Word Representations in Vector Space" Mikolov et al., 2013
Word2Vec, continuous bag of words (CBOW), skip-gram
50
Collobert & Weston vectors. Key idea:
A word and its context is a positive training sample; a random word in that sample context gives a negative training sample. positive: "cat chills on a mat" negative: "cat chills Ohio a mat"
51
Skip gram objective function
(average) negative log-likelihood
52
Skip gram parameters
word vectors
53
Skip gram probability P(w_{t+J} | w_t); theta - how is it defined?
inner product of center word and context word. calculates how likely center word appears with context word.
54
Skip gram probability - what makes computation expensive
size of vocabulary
55
Skip gram: Ways to reduce computation
1. Hierarchichal softmax 2. Negative sampling
56
GloVe: Key idea
Training of embedding done on global word-occurrence statistics from a corpus.
57
fastText: Key idea
Handles OOV words + multilingual
58
Intrinsic evaluation of word embeddings
Evaluated on a sub-task
59
Extrinsic evaluation of word embeddings
Evaluated on a downstream task
60
Intrinsic evaluation of word embeddings - example
word analogy task (man:woman, king:?)
61
Graph embedding definition
Learn features such that connected nodes are more similar than unconnected nodes
62
Hyperbolic embeddings - what is it good for?
Better at modeling hierarchichal structures
63
t-SNE key idea
Measures pairwise similarities in high dimensional space and performs SGD to minimize divergence between high-dimensional vs low-dimensional data
64
Encoders can handle language modeling, T/F
False. Encoders get bidirectional context so we can't do language modeling
65
Self-attention: Sizes of Query, Key, and Value matrices, given: 1. Hidden dimension: d_model 2. Q,K,V dimension: d_q, d_k, d_v
Query: d_model * d_k Key: d_model * d_k Value: d_model * d_v
66
Self-attention: Dimension of self-attention output (O), given: 1. Hidden dimension: d_model 2. Q,K,V dimension: d_q, d_k, d_v 3. Number of heads h
in_features: h * d_v out_features: d_model
67
Define "cross-attention"
Combine two different input sequence: 1. Sequence returned by the encoder 2. Sequence processed by the decoder.
68
Attention: big O for each layer, given: 1. seq length "n" 2. representation dimension "d
O (n^2 * d)
69
Attention: big O for sequential operations
O(1)
70
Attention: big O for maximum path length
O(1)
71
Why is self-attention layer's sequential operation O(1) compared to RNN's O(n)?
self-attention layer connects all positions together
72
Self-attention layers are faster than recurrent layers when the _____ is smaller than the _________
sequence length, representation dimensionality
73
Multi-head attention: How to get size of d_k and d_v given: 1. dimension of hidden: d_model 2. number of heads (h)
d_model / h
74
Multi-head attention: Given d_model (512) and number of heads (8), what is d_k?
64 (d_model / h)
75
Attention: What is the formula for scaled dot product?
Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_q)) @ V
76
Scaled dot product attention: As Query vector increases in dimension, magnitudes of dot product similarities increase. How do we mitigate this?
Normalize by dividing dot product by square root of Query vector dimension
77
Cross-attention: Where do K, V values come from
Encoder output
78
Cross-attention: Where do Q value come from
Masked multi-head attention from decoder
79
Self Attention: Q, K, V matrix shapes (ie A * B, what are A and B)
K = D_X * D_K Q = D_X * D_Q V = D_X * D_V
80
Purpose of Key vectors
Compare inputs to Queries
81
Purpose of Value vectors
Return knowledge from K and Q back to decoder
82
3 types of attention in Transformer
1. Cross-attention - Encoder K and V plus Q from decoder 2. Self-attention (Encoder) - K, V, Q from output of word + pos embedding 3. Self-attention (Decoder) - K, V, Q from (masked) output embedding t-1
83
Difference of Encoder-Decoder Attention vs self-attention
In E/D attention, Queries come externally(e.g. decoder). In self-attention, Queries come from the inputs themselves.
84
What happens if you permute the order of the inputs in an self-attention layer?
Permutation equivariant: Output values will have same values as ordered, but in different order.
85
Since self-attention is permutation equivariant, what must be added beforehand to propagate the order of the input sequence?
Add position embedding to word embedding.