Lecture 12 Flashcards

Sequence to Sequence, Attention, Transformer

1
Q

Encoder:

A

a LSTM that encodes the input sequence to a fixed-length internal
representation W.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Decoder:

A

another LSTM that takes the internal representation W to extract the output sequence from that vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Question Answering

A

we ingest a sentence with several words, processing the sentence with the help of a recurrent unit such as a LSTM. We preserve the “state” that results from ingesting that sentence into our trained model. The resulting context vector then serves as the context for a decoder module, also made
of LSTMs or GRUs. If we prompt the decoder, using a particular “state” vector and a “start of sentence” marker, we can generate output tokens (plus an end of sentence marker).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Seq2Seq

A

At a time step in the Encoder, the RNN takes a word vector (xi) from the input sequence and a hidden state (Hi-1) from the previous time step; the hidden state is updated (Hi)
The context vector to the decoder is the hidden state from the last unit of the encoder
(without the Attention mechanism) or the weighted sum of the hidden states of the encoder (with the Attention mechanism)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Inference

A

The task of applying a trained model to generate a translation is
called inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

MT Keras Overview – Input/Output

A
  • Source sequence for encoder:
    x= (x1, x2,…, x|x|) initially one-hot
    encoded that usually feeds into a
    word embedding layer
  • Target sequence:
    y= (y1, y2,…, y|y|) exists in two
    versions – decoder input has a start sentence token, decoder output has an end sentence token; these two sequences are offset by one time step
  • Final decoder output goes through a softmax layer that designates the probabilities of each entry in the vocabulary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Greedy search:

A

Choose the output word for
each time step with the highest p value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Beam search:

A

Choose the k highest words
(5<=k<=10) at the next time step; assemble an overall sequence with the max probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Attention Mechanisms

A

Unfortunately, the context vector between the two models is not always sufficient to produce a great result. Attention addresses this bottleneck.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Drawback of the “Vanilla” Encoder-Decoder

A
  • In the “vanilla” seq2seq model shown earlier, the decoder
    takes the final hidden state of the encoder (the context vector)
    and uses that to produce the target sentence.
  • The fixed-size context vector represents the final time step.
    Loosely speaking, the encoding process gives slightly more
    weight to each successive term in the input sentence.
  • Earlier terms may be more important than later, though, in
    driving the accuracy of the output of the decoder.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Attention Mechanism

A

An “attention mechanism” makes all hidden states of the encoder
visible to the decoder
- Embedding all the words in the input (represented by hidden states) while creating the context vector
- a learning mechanism that helps the decoder identifies where to
pay attention attention in the encoding when predicting at each
time step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Fully Attention-Based Approaches

A

What if we simply avoided the use of recurrent neural layers (such as LSTMs) and simply used attention layers instead? This was the insight behind BERT and other transformer-based” approaches. Transformer-based models have increased performance on some tasks in comparison to recurrent networks with attention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Transformer Architectures Proliferate

A
  • The basic transformer architecture from Vaswani et al. (2017; Advances in neural information processing systems) has morphed into numerous variations applied to a variety of tasks
  • In addition to NLP, transformers
    have been applied to genetics,
    computer vision, signal processing,
    video analysis
  • In 2021 alone, more than 6000
    papers have been published on
    applications and improvements to
    BERT
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly