Introduction to Transformers for NLP Flashcards

(37 cards)

1
Q

RNN

A

Recurrant Neural Network

In an RNN, the information is repeated endlessly within a loop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bag of Words

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

n-grams

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

trigram

A

A trigram model keeps the context of the last two words to predict the next word in the sequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

LSTM

A

Long Short-Term Memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

GRU

A

Gated Reccurrent Unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Feed-forward Neural Network

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

BP Mechanism

A

Back Propogation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Gradient Decent

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

T5 model

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

seq2seq

A

sequence-to-sequence neural network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

multi-head attention

A

multiple modules of self-attention capturing different kinds of attentions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

feed-forward

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

masked multi-head attention

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

linear

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

softmax

A

The softmax function is a mathematical function that converts a vector of real numbers into a probability distribution, where each value is between 0 and 1 and all values sum to 1.

17
Q

input embeddings

18
Q

output embeddings

19
Q

tokenize

20
Q

vectorize

21
Q

positional encoding

22
Q

self-attention

A

self-attention allows us to associate each word in the input with other words in the same sentence

23
Q

query vector

24
Q

key vector

25
value vector
26
embedding vector
27
residual connection
The original positional input embedding is then given the multi-headed attention output vector as an additional component
28
decoder
1. Multi-headedattention layer 2. Add and norm layers 3. Feed-forward layer
29
encoder
30
BERT
Bidirectional Encoder Representations from Transformers
31
BERT-Base
BERT-Base has a total of 110 million parameters, 12 attention heads, 768 hidden nodes, and 12 layers.
32
BERT-Large
BERT-Large is characterized by having 24 layers, 1024 hidden nodes, 16 attention heads, and 340 million parameter values.
33
Masked-LM
MLM
34
NSP
Next Sentence Prediction
35
CLS
Always consider the first token in a sequence to be a special classification token (also abbreviated CLS)
36
SEP
The [SEP] token serves to demarcate the break between the two sentences.
37