09 - Recurrent Neural Networks Flashcards

1
Q

What is a RNN (roughly explained?)

A

Recurrent neural network.
RNNs are used to process sequential data x^t={x^1,…,x^τ} , where t is the time/dimension step index

  • Even though other nets without sequence based spcialization could process this type of data input as well, RNNs can scale much longer sequences than other nets, just like CNNs can process larges images.
  • They can also process sequences of variable length (2d data as well)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

RNN Parameter Sharing.

A

Diff between multilayer nets & RNNs: parameter sharing across different parts of the model

  • Enabeling the model to be applied to inputs with different lengths and generalizing across them
  • enables extracting information that can occur at different positions in the input sequence
    • example: “I went to Nepal in 2009” vs “In 2009, I went to Nepal”
    • A traditional fully-connected feedforward net that processes fixed length sentences would have needed to learn all possibilities.

The way it works:
- each output member is a function of the previous output member & produced using the same update rule applied to the previous outputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unfolding Computational graphs

A

classical dynamical system: s(t) = f(s(t-1);theta), s - state of the system
- we call this system recurrent, bc the definition of state s at time t refers back to the same definition one time t-1
A graph can be unfolded for a finite nr of steps tau, by applying the definition tau-1 times. Eg the equation above for tau=3: s(3)=f(s(2);theta)=f(f(s(1);theta);theta)

Example of system driven by external signal: h(t) = f(h(t-1);x(t),\theta)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the advantages of unfolding computational graphs?

A

The unfolding process allows to factorize g into repeated application of function f. This gives following advantages:

  1. The input size for the model is always the same, regardles of the sequence length, bc it is specified in the terms of transition from state to state, not in terms of a variable-length history of states
  2. The same transition function f with the same parameters can be applied at every step

The model is now generalized to sequence length not given in the training set, which allows for far fewer training examples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What RNN designs/output modes are there?

A

Sequence to Sequence (A prototypical RNN)
Sequence to Single Output
Single Input to Sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the Sequence to Sequence RNN Design

A

Both the input x and the output o are sequences. Such an output could be a classification vector with c elements for c classes. For example if we have a dictionary with 100 words, we get a 100dim vector. There are recurrent connections between hidden units.
Draw the RNN with following:
x->h->o->L->Y
U - Input to hidden (like feedforward FCNs or CNNs)
W - Hidden to Hidden, defining a transition to the next state
V - Hidden to Output, like regular NNs, transforms low/high dim codes to correct output type (eg classification or regression)
NOTE: loss and y are only used for training

For each time step from t= 1 to t=τ, following update equations are used:
- Pre-nonlinear activation: $a^t = Wh^{t-1}+Ux^t+b$ (so like the other nets just with one extra term (W?)
- Non-Linearity (get state): $h^t = tanh(a^t)$
- Transition to output: $o^t = Vh^t+c$ (c is an optional extra bias, normally you wanna use it, o are our logits)
- E.g if building classifier: $\hat{y}^t=softmax(o^t)$

NOTE: the update equations are

  • differentiable (backprop)
  • applied at every time step
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the Sequence to Single Output RNN Design

A

Recurrent connections between hidden units, that read the entire sequence and then produce a single output from that. The output (a tensor) can be seen as the summary of the input sequence.
Draw the RNN with following:
x->h->o->L->Y
U - Input to hidden (like feedforward FCNs or CNNs)
W - Hidden to Hidden, defining a transition to the next state
V - Hidden to Output, like regular NNs, transforms low/high dim codes to correct output type (eg classification or regression)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the Single Input to Sequence RNN Design

A

We are not getting an input at each time step, but only one input (eg at the start). It can be used at each step like on the graph but mostly it is just applied at the start
True generation of data
The recurrence is both the feedback connection from the output to the next steps hidden layer and from the pas hidden layer. In training it is the y not o that is gven to the next hidden layer.
Straw hat example
The tokens are often <bos> (beginning of sentence) or <eos> (end of sentence)</eos></bos>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

RNN - Backpropagation Through Time

A

To compute the gradient in a RNN, the generalized back-propagation algorithm (section 6.5.6) is simply applied to the unrolled computational graph.

So, for each node N, the gradient $∇_NL$ is computed recursively, based on the gradient computed for nodes that follow in the graph

  1. Like usually, we start with the delta at the output (following is the componentwise notation from the book, Anders does not use that). It’s the simple numerical diff between the output and target ∇_{o(t)}L_i=y_hat(t) -y(t)
  2. Now go one step back to get the hiddens (just back as in calculation steps, not time step) ∇{h(t)}L= V^T ∇{o(t)}L
  3. Now we can iterate back through time to back-propagade gradients from t=τ-1 to t=1. So h(t) will descent from o(t) and h(t+1).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Using sequence to sequence design in generative networks.

A

Most generative networks do not only take the input but also the last output (if guessing the next word, it makes sense to know what was sair before). The connection between the hiddens are still there.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Encoder-Decoder RNNs

A

A structure in which both input and output can have variable lengths. This is used in LLMs, for example chatGPT is decoder only, BERT is encoder only.
Encoder only nets are often used for eg classifying, while decoder only nets are used for generation of human like content

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bidirectional RNNs

A

ses forward and backward stream of hidden states.
For example, it could make sense to read sentences backward and forward. This does not always make sense but sometimes it gives a little boost.

Transformers take all directions always (LMMs are a subclass of transformers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name strategies to make RNNs deeper

A
  1. add more hidden states that connect to each ther over time
  2. add non-recurrent, regular layers
  3. adding skip connections to strategy 2 to avoid problems with backpropagation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why use LSTM? Pros and Cons?

A

Normal RNNs suffer from long term dependency problems, meaning over time, as more and more information piles up, the less effective at learning new things the nets become

→ LSTM is a solution to that problem and allows a net to remember the things it needs to keep ahold of the context but forget things that no longer matter.

LSTMs are way more computationally heavy, because each gate is coupled to the current input and previous state with their own weights and biases: U,W,U^g,W^g,U^f,W^f,U^o,W^o,b,b^g,b^f,b^o
LSTMs are much better for long sequences.

Normal RNNs are so deep, a long row of vector multiplications, that there is a risk of the values exploding or converging to 0. Especially backprop is hard.
Training is faster and better with LSTMs, so even though we have many more parameters (x4) it is worth it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain LSTMs.

A

The LSTM cell includes three gates:
1. Forget gate: what is no longer important for context and can be forgotten
2. Input gate: what new information should be added or updated
3. Output gate: what should be output

All gates are values between 0 and 1 (following a sigmoid)
- 1 - let everything in
- 0 - full shut down
- For each time step, we put the input and previous state into 4 sublayers

Input:
- current input x and previous state h and are squashed by tanh
- tanh function is standart but any non-linearity can be used

State:
- function of the previous state
- much like momentum →current value is accumulated as a running average of the past history
- BUT the momentum coefficient is dynamic and implemented in the forget gate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Embeddings

A

Problem: representation of descrete inputs, characters can be represented as ASCII code but what about words?
Solution: Embeddings

  1. create indexng of N most frequent words
  2. map these 0-N integers to a high-dimensional real-valued vector

Example: vocabulary with 20k words, 128dim vector space, initialize random 20k by 128 matrix and use input word index to find corresponding row(feature vector)

So things that are close in meaning are close to each other in the new space.
But most importantly every input is turned into the same size
The initial values might not be good, so the matrix is trainable.

16
Q

RNN: what are V, W and U?

A

U - Input to hidden (like feedforward FCNs or CNNs)

W - Hidden to Hidden, defining a transition to the next state (we basically have a lot of fully connected nets just put together in a different way than we are used to)

V - Hidden to Output, like regular NNs, transforms low/high dim codes to correct output type (eg classification or regression)