Text Classification 4: Recurrent Neural Network Flashcards

Question 1

Q

What is an RNN?

Answer

A

Recurrent Neural Network (RNN) is a NN that has arbitrary input (possibly also output I think), which can be used to embed a sequence of arbitrary number of tokens into one vector, or into multiple vectors (POS tagging).

y_n = RNN(x_1:n)

From this, we can also express
y_n:1 = RNN*(x_1:n) where each y is the RNN of tokens from 1 to n

After this sequence, we can feed the y vectors in the MLP, softmax, or whatever depending on the application

Question 2

Q

Draw a diagram of RNN and explain how it works.

Answer

A

For each time step i, we compute the hidden state s_i by taking the current token and the previous hidden state (this is usually done by having two matrices + bias + activation function). The current output y_i is computed from the current hidden state (could be just the copied hidden state).

For different applications, the all outputs other than the last one can be removed. The parameters (matrices) are shared across iterations.

Question 3

Q

Explain me the RNN as acceptor (encoder) and as transducer.

Answer

A

Acceptor: only the final y_n is taken and something is done to it (softmax, MLP…) and loss is computed on that final value.

Transducer: each step y_i:n is taken and passed through the computation (softmax, MLP…) and total loss is the sum of all losses (POS tagging as an example)

Question 4

Q

What is the bi-directional RNN?

Answer

A

We don’t do only one forward pass through the tokens, but we do two passes: forward + backward. These passes usually have different weights, and the final value is the concatenated vector of these two passes.

Yesterday John ate an icecream.

y_3 = [RNN(x_1:2), RNN(x_5:4)]

Question 5

Q

Explain a simple RNN how it works

Answer

A

To compute the current hidden state s_i, we take the previous hidden state s_i-1 multiplied with one matrix + current token embedding * different matrix + bias, and pass the total sum through an activation function like relu or tanh.

To compute the current output, it is just the hidden state, nothing else.

Question 6

Q

What is a common problem with RNNs?

Answer

A

Vanishing/exploding gradient problem
When information is compacted together, all info is merged and the RNN forgets some information easily. hard to capture the long-range dependencies.

Question 7

Q

What changes with the gated RNN compared to the simple one? What is a problem?

Answer

A

Problem with simple RNN is that, to compute the hidden state, we always take the full memory and total memory is written. With gates, we limit what info is taken from the previous hidden state and what is taken from the current vector embedding.

The idea is to take some elements from the current token embedding and some from the previous hidden state. This is done using hadarmard-product (element-wise product). Gates are the same size as embeddings/hidden_state, and the current hidden state is computed:

s’ = g⊙x + (1-g)⊙s

where s is the same as with the simple RNN (matrices*x and h + bias)

Problem: hard gates (can’t only take partial memory of some element), better to replace with soft gates (LSTM)

Question 8

Q

Explain LSTM, draw a diagram

Answer

A

The state y at each step is composed of two types of memories: short (previous hidden state h_j) and long (previous cell state c_j).

LSTM has 3 gates (SIGMOID):
- input: controls how much of the current token do we take
- forget: controls how much of the long term memory to keep
- output: controls how much of the current memory to pass as a next short term memory h_j

All gates are the same, just with different matrices and biases. The tanh is used as an activation function for the amount of knowledge, while sigmoid is used for gates.

FIXES GRADIENT VANISHING PROBLEM (MITIGATES)

Question 9

Q

Text Classification 4: Recurrent Neural Network Flashcards

(9 cards)