Topic 4: Sequence Learning with RNNs Flashcards
(20 cards)
What distinguishes sequence learning from one-shot learning in machine learning?
Sequence learning Machine learning models that input or output data sequences are known as sequence models. Text streams, audio clips, video clips, time-series data, and other types of sequential data are examples of sequential data. Recurrent Neural Networks (RNNs) are a well-known method in sequence models.
One-shot learning is the classification task where a model has to predict the label of inputs without having trained on the class involved at all. For this task we give one or few examples of each possible classes and it has to classify each input in one of the classes of the examples. Humans are very good at one-shot learning, indeed if someone sees a giraffe for the first time in his life he would easily recognize the animal if he sees one again. However, this is a bit more tricky when it comes to computationnally model it.
What are five common input-output patterns in sequence modeling tasks?
one-to-one: E.g. image classification. normal multi layer perceptron and convolutional neural network. Single input → Single output (simple vertical flow)
- Input is fixed-size (e.g., an image).
- Output is a single class label.
- No temporal or sequential data.
one-to-many: E.g. image captioning. image → sequence of words. One input → Sequence of outputs
- RNN generates a sequence of words conditioned on one image.
- Output is a sequence, input is a static vector.
many-to-one: E.g. sentiment analysis. sequence of words → sentiment. Sequence of inputs → Single output
- Input is a sequence (e.g., words).
- Output is one label (e.g., sentiment class).
- RNN must encode the whole sequence into a final state.
many-to-many: e.g. part of speech tagging or video classification on frame level. Sequence of inputs → Sequence of outputs (same lengths).
- Inputs and outputs are aligned step-by-step.
- Useful for tagging and labelling tasks.
many-to-many: e.g. machine translation. sequence of words → sequence of words. Sequence of inputs → Sequence of outputs (different lengths)
- Requires encoder-decoder architecture.
- RNN encodes the source sequence into a context vector.
- A second RNN decodes the output sequence.
RNNs are used in all except for the one-to-one
What core mechanism allows RNNs to model sequences?
An RNN is a neural network which maps from an input space of sequences to an output space of sequences in a stateful way. So the prediction of output y_t depends not only on the input x_t, but also on the hidden state of the system, h_t, which gets updated over time, as the sequence is processed.
These models can be used for sequence generation, sequence classification, and sequence translation.
The use of a hidden state that is recurrently updated over time, allowing the network to retain information about previous time steps.
How is the hidden state of an RNN mathematically updated?
h_t = φ(W_xh[x; yt−1] + W_hh * h_t−1 + b_h)
x_t = input at time t
W_xh = input-to-hidden weights
W_hh = hidden-to-hidden weights (recurrence)
h_t-1 = previous hidden state
b_h = bias term
φ = activation function (ReLU or sigmoid)
What is “teacher forcing” in RNN training?
In an RNN we set the input to x_t = w_t-1 and the output to y_t = w_t
we condition on the ground truth labels from the past w_1:t-1, not labels generated from the model. this is called** teacher forcing**, since the teacher’s values are force fed into the model as input at each step, so x_t is set to w_t-1
Teacher forcing can sometimes result in models that perform poorly at test time, the reason for this is that at test time, it encounters an input sequence w_1:t-1 generated from the previous step that deviates from what it saw in training.
What is the remedy to teacher forcing?
A SOLUTION to teacher forcing is known as scheduled sampling. this starts off by using teacher forcing, but at random time steps, and it feeds in the samples from the model instead, the fraction of time this happens is gradually increased. Another alternative solution is to use other kinds of models where MLE training works better, such as in 1d CNNs
Why do RNNs suffer from gradient- and vanishing gradients?
The main reasons are the following traits of BPTT:
- An unrolled RNN tends to be a very deep network.
- In an unrolled RNN the gradient in an early layer is a product that (also) contains many instances of the same term.
Exploding gradients: when the gradients explode, because they become excessively large
vanishing gradients: when the gradients vanish, because they become so small and decay
the activations in an RNN can decay or explode as we go forwards in time, since we multiply by the weight matrix, W_hh at each time step. the gradients in an RNN can decay or explode as we go backwards in time, since we multiply the Jacobians (partial derivatives) at each time step.
What architectural solutions help mitigate vanishing gradients?
Gated Recurrent Unit (GRU): is a type of Recurrent Neural Network (RNN) that was introduced as a simpler alternative to LSTMs. GRUs aim to solve the vanishing gradient problem by providing a gating mechanism that helps the model selectively remember and forget information. It’s good because we can capture long-term dependencies
Long short term memory (LSTM): More sophisticated version of GRU. The additive update function for the cell state gives a derivative thats much more ‘well behaved’
The gating functions allow the network to decide how much the gradient vanishes, and can take on different values at each time step. The values that they take on are learned functions of the current input and hidden state.
peephole connections: we pass the cell state as an additional input to the gates. some of this worker better than LSTMs and GRUs but in general LSTMs do consistently well across most tasks.
What can solve the exploding- and vanishing gradient problem?
TO COMBAT THIS, we can use gradient clipping. actual methods, such as an attempt to control the spectral radius λ of the forward mapping, W_hh in such a way to ensure λ ≈ 1 and then keep it fixed (i.e. we don’t learn W_hh).
in this case, the output matrix W_ho needs to be learned resulting in a convex optimisation problem. this is called an echo state network (ESN). something related is called a liquid state machine (LSM), which uses binary-values (spiking) neurons instead of real-valued neurons.
ESN and LSM is reservoir computing.
What components make up an LSTM cell?
the basic idea is to augment the hidden state h_t with a memory cell c_t. we need three gates to control this cell:
- output gate O_t determines what gets read out
- input gate I_t determines what gets in
- forget gate F_t determines when we should reset the cell
FINALLy we can compute the hidden state to be a transformed version of the cell, provided the output gate IS ON:
H_t = O_t ⊙ tanh(C_t)
H_t is used as the output of the unit,a s well as the hidden state for the next time step. this lets the model remember what it has just output (short-term memory), whereas the cell C_t acts as a long-term memory.
How do GRUs differ from LSTMs?
- LSTM: more accurate longer sequence
- GRU: less training parameters
- Both: many hyperparameters, difficult to master!
GRUs combine the forget and input gates into a single update gate, and merge the cell and hidden state. They are simpler and often equally effective. LSTM have the forget and input gates as separate gates.
What are bidirectional RNNs and when are they useful?
single fixed-length output vector y that we want to predict, given the variable length sequence as input. This is sequence to vector (seq2vec)
We can often get better results if we let the hidden states of the RNN depend on the past and future context.
start by creating two RNNs:
- first one: recursively computes hidden states in the forwards direction
- second one: recursively computes hidden states in the backwards directions
THIS is called a bidirectional RNN
h_t = [h^→_t, h^←_t] is the representation of the state at time t, taking into account the past, and future information. THEN, we can average pool over these hidden states to get the final classifier:
https://docs.google.com/document/d/1ps-ou9il-RUPwSpfuYV99Tv4jBn5jmIdKaAOm58z23o/edit?tab=t.0
What is a “reservoir computing” model?
Where only the output weights are trained and the recurrent layer (the “reservoir”) is fixed and randomly initialized.
Do not unfold (e.g. just the output layer)
- Randomize untrained connections (input & hidden layers)
- Use linear methods for training (e.g. Linear regression)
Echo States Networks [H. Jaeger 2001]
- Average firing-rate neurons (leaky)
Liquid State Machines [W. Maass 2002]
- Spiking neurons
Why might reservoir approaches be appealing?
They allow fast training and avoid backpropagation through time (BPTT), while still capturing dynamic temporal structure.
How does memory differ in RNNs vs CNNs or MLPs?
CNNs/MLPs treat inputs as independent, while RNNs maintain temporal memory using hidden states, making them suitable for time-series or sequential data.
What is the computational cost of training RNNs compared to feedforward networks?
Higher, due to backpropagation through time (BPTT) and sequential dependencies, since we can’t do anything in parallel.
What is multiplicity of time?
Multiple Timescale RNN (MTRNN): different parts of a sequence operate at different temporal speeds. MTRNNs exploit this by assigning different timescales to different parts of the model. It introduces layers of neurons that update at different rates using a time constant
Skip Connections & Clockwork Rnn (CRNN): Introduce time-delayed or skip connections in RNNs to let different parts update at different rates or time intervals, improving efficiency and long-range memory.
What limitations of RNNs led to the development of Transformer models?
What is BPTT (backpropagation through time)
BPTT is the training algorithm for RNNs. It extends standard backpropagation by unfolding the recurrent network across time steps and computing gradients through each time step.
This allows the model to update weights based on sequential dependencies.
This method is called Backpropagation Through Time (BPTT) which extends traditional backpropagation to sequential data by unfolding the network over time and summing gradients across all relevant time steps. This method enables RNNs to learn complex temporal patterns.
When using BPTT we can train the model with batches of short sequences, usually created by extracting non-overlapping subsequences (windows) form the original sequence. if the previous subsequence ends as time t-1, and the current subsequence starts at time t, we can carry over the hidden state of the RNN across batch updates during training. IF THE SUBSEQUENCES aren’t ordered, we need to reset the hidden state.
What is the effect of the gating mechanisms in LSTM cells?
Often its important to understand the model, the acrhitecture and the mechanimsms. but also ti be abkle to understand the effect of these mechanisms, on a aspceicfic problem, how can we transfer it to a different domain or type of problme