Lecture 11 Flashcards

Long Short-Term Memory and Gated Recurrent Units for NLP

1
Q

Long Short-Term Memory (LSTM)

A

Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) that is designed to handle sequential data such as time series, speech, and text. It is composed of a cell, an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell. LSTM networks are capable of processing data sequentially and keeping their hidden state through time. They are applicable to various tasks such as classification, speech recognition, machine translation, and healthcare.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Feedforward

A

Simple, unidirectional predictive
structures connecting input arrays
to output arrays

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Convolutional

A

Sliding window moving across
time or multi-dimensional structures to capture features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Recurring

A

Neurons with feedback loops creating memory structures
with limited persistence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Gated

A

Cell units containing multiple
neurons and providing long term
memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Backpropagation in RNNs

A

A recurrent neural network can be imagined as multiple copies of the same network, each passing a message to a successor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Vanishing Gradient Problem

A

Words from time steps far away are not as influential as they should be any more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Forget gate:

A

how much information from the previous time step will
be kept?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Input gate:

A

which values will be updated and the new candidate values
Sigmoid function: outputs a number between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Tanh function

A

(hyperbolic tangent function): outputs a number between -1 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cell state:

A

Cell state: Update the old cell state, Ct-1, into the new cell state Ct.
* The new cell state 𝐶! is comprised of information from the past 𝑓! ∗ 𝐶!”# and valuable new information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

elementwise multiplication

A

8 0 0 |
| 3 1 3 |
| 2 0.5 1 |
| 4 1 4 |
| 2 4 8 |

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Based on the cell state, we will decide what the output will be

A
  • tanh function filters the new cell state to characterize stored information
  • Significant information in 𝐶t -> ±1
  • Minor details -> 0
  • ℎt serves as a hidden state for the next time step
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Gated Recurrent Units (GRU)

A

In 2014, Cho and his colleagues posted a paper entitled, “Learning
phrase representations using RNN encoder-decoder for statistical
machine translation.” In this paper, the researchers introduced a
simplified LSTM model, which later became referred to as a GRU.
They evaluated their approach on the English/French translation task
of the WMT’14 workshop. In later papers, the GRU has often
performed as well as LSTM, even though it is simpler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Gated Recurrent Unit (GRU)

A

GRU is a variation of LSTM that also adopts the gated design.
* Differences:
* GRU uses an update gate 𝒛 to substitute the input and forget gates
* Combines the cell state 𝐶! and hidden state ℎ! in LSTM as a single cell state ℎ!
* GRU obtains similar performance compared to LSTM with fewer parameters and
faster convergence. (Cho et al. 2014)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Update gate:

A

controls the composition of the new state

17
Q

Reset gate:

A

determines how much old information is needed
in the alternative state ℎ#!

18
Q

Alternative state:

A

contains new information

19
Q

New state:

A

replace selected old information with new information in the new state

20
Q

Text summarization using LSTM-CNN Song et al., 2018, Multimedia Tools & Apps

A
  • Abstractive Text Summarization
    generates readable summaries
    without being constrained to
    phrases from the original text
  • Training data: human generated
    abstractive summary bullets from
    CNN and DailyMail stories
  • ROUGE (Recall-Oriented
    Understudy for Gisting Evaluation)
    toolkit was used for evaluation
  • LSTM-CNN outperformed four
    previous models by 1-4%
21
Q

Extracting Temporal Relations from Korean Text
Lim & Choi, 2018, IEEE Big Data/Smart Computing

A
  • From the article: “difficult to
    correctly recognize the temporal
    relations from Korean text owing to the inherent linguistic
    characteristics of the Korean
    language”
  • Dataset: Korean TimeBank - 2393
    annotated documents and 6190
    Korean sentences
  • F1 scores ranged from 0.46 to 0.90 on various temporal relations
22
Q

Emotion Recognition in Online Comments (Li & Xiao, 2020)

A
  • This model consists of
  • an embedding layer
  • a bidirectional LSTM
    layer
  • a feedforward
    attention layer
  • a concatenation layer
  • an output layer
  • Training data: Emotion
    labelled twitter data and
    blog data
  • F-1 measure: 62.78%
23
Q

LSTM
Key Features

A
  • Long Short-Term Memory layer - Hochreiter 1997.
  • Based on available runtime hardware and constraints, this layer will choose different
    implementations (cuDNN-based or pure-TensorFlow) to maximize the performance. If a GPU is available and all the arguments to the layer
    meet the requirement of the CuDNN kernel (see below for details), the layer will use a fast cuDNN implementation.
  • When processing very long sequences (possibly infinite), you may want to use the pattern of cross-batch statefulness.
  • Normally, the internal state of a RNN layer is reset every time it sees a new batch (i.e. every sample
    seen by the layer is assumed to be independent of the past). The layer will only maintain a state while
    processing a given sample.
24
Q

LSTM
Key Arguments

A
  • units: Positive integer, dimensionality of the output space.
  • activation: Activation function to use. Default: hyperbolic tangent (tanh). If you pass None, no
    activation is applied (ie. “linear” activation: a(x) = x).
  • recurrent_activation: Activation function to use for the recurrent step. Default: sigmoid (sigmoid). If
    you pass None, no activation is applied (ie. “linear” activation: a(x) = x).
  • kernel_initializer: Initializer for the kernel weights matrix, used for the linear transformation of the
    inputs. Default: glorot_uniform.
  • unit_forget_bias: Boolean (default True). If True, add 1 to the bias of the forget gate at initialization.
    Setting it to true will also
    force bias_initializer=”zeros”
25
Q

GRU
Key Features

A
  • Gated Recurrent Unit based on Cho et al (2014).
  • There are two variants of the GRU
    implementation. The default one is based on v3 and has reset gate applied to hidden state
    before matrix multiplication. The other one is based on original and has the order reversed.
  • The second variant is compatible with CuDNNGRU (GPU-only) and allows inference on CPU. Thus it
    has separate biases for kernel and
    recurrent_kernel. To use this variant, set ‘reset_after’=True and
    recurrent_activation=’sigmoid’.
  • In TensorFlow 2.0, the built-in LSTM and GRU layers have been updated to leverage CuDNN
    kernels by default when a GPU is available. With this change, the prior layers have been deprecated, and you can build your model without worrying about the hardware it will run on.
26
Q

GRU
Key Arguments

A
  • units: Positive integer, dimensionality of the output space.
  • activation: Activation function to use. Default: hyperbolic tangent (tanh). If you pass None, no
    activation is applied (ie. “linear” activation: a(x) = x).
  • recurrent activation: Activation function to use for the recurrent step. Default: sigmoid (sigmoid). If
    you pass None, no activation is applied (ie. “linear” activation: a(x) = x).
  • kernel initializer: Initializer for the kernel weights matrix, used for the linear transformation of the
    inputs. Default: glorot_uniform.
  • unit_forget_bias: Boolean (default True). If True, add 1 to the bias of the forget gate at initialization.
    Setting it to true will also
    force bias initializer=”zeros”. This is recommended in Jozefowicz et al..
27
Q
A