Quiz 4 Flashcards

1
Q

RNN forward update rule

A
  1. At each time step, input and hidden are fed through a linear layer.
  2. Followed by non-linear (tanh)
  3. Multiplied by output linear layer to get logits
  4. Softmax to get probabilities
  5. Get CE loss

Formally:
U, V, W = weights for input-to-hidden, hidden-to-output, hidden-to-hidden
a_t = Ux_t + Wh_{t-1} + b
h_t = tanh(a_t)
o_t = Vh_t + c
y_hat_t = softmax(o_t)
Lt = CE(y_hat_t, y_t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

RNN: How is loss calculated over entire sequence

A

Loss at each time step is summed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RNN: True or False. hidden weights are shared across the sequence.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

RNN: Advantage of sharing parameters across sequence

A

Sharing allows generalization to sequence lengths that did not appear in the trainings set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

RNN: RNN architecture must always pass the hidden state to next sequence

A

False. Goodfellow book show examples where the output from t-1 is passed to the hidden layer at t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

RNN: RNN architecture that passes only the output to next time step is likely to be less powerful.

A

True. If hidden state is not passed, it lacks important information from the past.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Vanishing gradient

A

Gradient diminishes as they are backpropagated through time, leading to no learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Exploding gradient

A

Gradient grow as they are backpropagated through time, leading to unstable learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

RNN cons

A
  1. Vanishing gradeints
  2. Exploding gradients
  3. Limited “memory” - Can handle short-term dependencies but can’t handle long term
  4. Lack control over memory - unlike LSTM, does not have mechanism to control what information should be kept.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

RNN: Advantage over LSTM

A

Smaller parameter size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

LSTM update rule - components and flow

A

Sequence (FICO):
1. Forget gate
2. Input gate
3. Cell Gate
4. Output gate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

LSTM: What state is passed throughout the layer

A

Cell state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

LSTM: What does forget gate do

A

Take input and hidden state, decides what information from the cell state should be thrown away or kept.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

LSTM: What does input gate do

A

Updates the cell state with new information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

LSTM: What does cell gate do

A

Combines information from forget gate, input gate, and candidate cell state to output new cell state at time step t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

LSTM: What does output gate do

A

Decides the next hidden state based on the updated cell state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

LSTM: Pros over RNN

A

Controls the flow of the gradient so that they neither vanish or explode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

RNN: What is a recursive neural network (RecNNs, not RNN!), and what are its advantages

A
  1. Can handle hierarchical structure
  2. Reduce vanishing gradients by having nested, shorter RNNs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

GRU: Main difference to LSTM

A

Single gate controls forget and cell state update.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Gradient clipping: main use

A

Control exploding gradients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why does MLP not work for modeling sequences

A
  1. Can’t support variable-sized inputs or outputs
  2. No inherent temporal structure
  3. Can not maintain “state” of sequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

LM: Define Language Models

A

Estimate probabilities of sequences that allow us to perform comparison.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

LM: How are probabilities of an input sequence of words calculated?

A

Chain rule of probability.

Probability of sentence = Product of conditional probabilities over i, which indexes our words.

p(s) = p(w_1) p(w2 | w1) p(w3 | w1, w2) … p(w_n | w_n-1 … W1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

LM: 3 applications of language modeling

A
  1. predictive typing
  2. ASR
  3. grammar correction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

LM: How is “conditional” language modeling different

A

Adds an extra context “c” to the chain rule of probability:

p(s | c) = product of p(w_i | c, w_i-1 … w_1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Conditional LM: What is context and sequence for

Topic-aware language model

A

C = topic
S = text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Conditional LM: What is context and sequence for

Text summarization

A

C = long document
S = summary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Conditional LM: What is context and sequence for

Machine Translation

A

C = French text
S = English text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Conditional LM: What is context and sequence for

Image captioning

A

C = image
S = caption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Conditional LM: What is context and sequence for

OCR

A

C = image of a line
S = its content

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Conditional LM: What is context and sequence for

Speech Recognition

A

C = recording
S = its transcription

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Teacher forcing

A

During training, the model is also fed with the true target sequence, not its own generated output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Knowledge distillation

A

Smaller model (student) learns to mimic the predictions of a larger, more complex model (teacher) by transferring its knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

T or F: Hard labels are passed from teacher model to student

A

False. Soft labels gives more signals than hard.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Knowledge distillation loss components: Teacher-student loss

A

CE loss between prediction scores between teacher / student

can be hard or soft labels

36
Q

Knowledge distillation loss components: Student loss

A

CE loss between student prediction and ground truth

37
Q

Knowledge distillation loss components: Combined loss

A

teacher-student loss * weights + student loss * weights

38
Q

Define cross-entropy loss

A

Expected number of bits required to represent an event from reference distribution (P*) when using a coding schema optimal for P.

39
Q

Define per-word cross-entropy

A

Cross-entropy average over all the words in a sequence

40
Q

What is the reference distribution (P*) in per-word cross entropy?

A

Empirical distribution of the words in the sequence

41
Q

Define perplexity

A

Geometric mean of the inverse probability of a sequence of words.

42
Q

What is perplexity of choosing 1 for a 10 sided dice?

A

10

(discrete uniform distribution of k is k)

43
Q

Define perplexity using law of logarithms

A

Perplexity is the log of per-word cross-entropy

44
Q

Define pretraining task

A

A task not specifically for the final task, but can help us achieve getting better initial parameters for modeling the final task.

45
Q

Masked language models: Key idea

A

Model learns to predict masked tokens of a sequence.

46
Q

Embeddings: Distributional semantics

A

A word’s meaning is given by the words that frequently appear close-by

47
Q

Embeddings: Key idea of “A Neural Probabilistic Language Model” Bengio, 2003

A

Map feature vector to each word in the vocabulary to predict next word.

48
Q

Embeddings: Key idea of “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning” Collobert & Weston, 2008 & “Natural Language Processing (Almost from Scratch)” Collobert et al., 2011

A

Use CNN to create word embeddings
KNN of vectors show similar syntax + semantics

49
Q

“Efficient Estimation of Word Representations in Vector Space” Mikolov et al., 2013

A

Word2Vec, continuous bag of words (CBOW), skip-gram

50
Q

Collobert & Weston vectors. Key idea:

A

A word and its context is a positive training sample; a random word in that sample context gives a negative training sample.

positive: “cat chills on a mat”
negative: “cat chills Ohio a mat”

51
Q

Skip gram objective function

A

(average) negative log-likelihood

52
Q

Skip gram parameters

A

word vectors

53
Q

Skip gram probability P(w_{t+J} | w_t); theta - how is it defined?

A

inner product of center word and context word. calculates how likely center word appears with context word.

54
Q

Skip gram probability - what makes computation expensive

A

size of vocabulary

55
Q

Skip gram: Ways to reduce computation

A
  1. Hierarchichal softmax
  2. Negative sampling
56
Q

GloVe: Key idea

A

Training of embedding done on global word-occurrence statistics from a corpus.

57
Q

fastText: Key idea

A

Handles OOV words + multilingual

58
Q

Intrinsic evaluation of word embeddings

A

Evaluated on a sub-task

59
Q

Extrinsic evaluation of word embeddings

A

Evaluated on a downstream task

60
Q

Intrinsic evaluation of word embeddings - example

A

word analogy task (man:woman, king:?)

61
Q

Graph embedding definition

A

Learn features such that connected nodes are more similar than unconnected nodes

62
Q

Hyperbolic embeddings - what is it good for?

A

Better at modeling hierarchichal structures

63
Q

t-SNE key idea

A

Measures pairwise similarities in high dimensional space and performs SGD to minimize divergence between high-dimensional vs low-dimensional data

64
Q

Encoders can handle language modeling, T/F

A

False. Encoders get bidirectional context so we can’t do language modeling

65
Q

Self-attention: Sizes of Query, Key, and Value matrices, given:
1. Hidden dimension: d_model
2. Q,K,V dimension: d_q, d_k, d_v

A

Query: d_model * d_k
Key: d_model * d_k
Value: d_model * d_v

66
Q

Self-attention: Dimension of self-attention output (O), given:
1. Hidden dimension: d_model
2. Q,K,V dimension: d_q, d_k, d_v
3. Number of heads h

A

in_features: h * d_v
out_features: d_model

67
Q

Define “cross-attention”

A

Combine two different input sequence:
1. Sequence returned by the encoder
2. Sequence processed by the decoder.

68
Q

Attention: big O for each layer, given:
1. seq length “n”
2. representation dimension “d

A

O (n^2 * d)

69
Q

Attention: big O for sequential operations

A

O(1)

70
Q

Attention: big O for maximum path length

A

O(1)

71
Q

Why is self-attention layer’s sequential operation O(1) compared to RNN’s O(n)?

A

self-attention layer connects all positions together

72
Q

Self-attention layers are faster than recurrent layers when the _____ is smaller than the _________

A

sequence length,
representation dimensionality

73
Q

Multi-head attention:
How to get size of d_k and d_v given:
1. dimension of hidden: d_model
2. number of heads (h)

A

d_model / h

74
Q

Multi-head attention:
Given d_model (512) and number of heads (8), what is d_k?

A

64 (d_model / h)

75
Q

Attention:
What is the formula for scaled dot product?

A

Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_q)) @ V

76
Q

Scaled dot product attention:
As Query vector increases in dimension, magnitudes of dot product similarities increase. How do we mitigate this?

A

Normalize by dividing dot product by square root of Query vector dimension

77
Q

Cross-attention: Where do K, V values come from

A

Encoder output

78
Q

Cross-attention: Where do Q value come from

A

Masked multi-head attention from decoder

79
Q

Self Attention:
Q, K, V matrix shapes
(ie A * B, what are A and B)

A

K = D_X * D_K
Q = D_X * D_Q
V = D_X * D_V

80
Q

Purpose of Key vectors

A

Compare inputs to Queries

81
Q

Purpose of Value vectors

A

Return knowledge from K and Q back to decoder

82
Q

3 types of attention in Transformer

A
  1. Cross-attention - Encoder K and V plus Q from decoder
  2. Self-attention (Encoder) - K, V, Q from output of word + pos embedding
  3. Self-attention (Decoder) - K, V, Q from (masked) output embedding t-1
83
Q

Difference of Encoder-Decoder Attention vs self-attention

A

In E/D attention, Queries come externally(e.g. decoder). In self-attention, Queries come from the inputs themselves.

84
Q

What happens if you permute the order of the inputs in an self-attention layer?

A

Permutation equivariant: Output values will have same values as ordered, but in different order.

85
Q

Since self-attention is permutation equivariant, what must be added beforehand to propagate the order of the input sequence?

A

Add position embedding to word embedding.