Introduction to Transformers for NLP Flashcards by Rick Swanson

RNN

Recurrant Neural Network

In an RNN, the information is repeated endlessly within a loop

How well did you know this?

Not at all

Perfectly

Bag of Words

How well did you know this?

Not at all

Perfectly

n-grams

How well did you know this?

Not at all

Perfectly

trigram

A trigram model keeps the context of the last two words to predict the next word in the sequence

How well did you know this?

Not at all

Perfectly

LSTM

Long Short-Term Memory

How well did you know this?

Not at all

Perfectly

GRU

Gated Reccurrent Unit

How well did you know this?

Not at all

Perfectly

Feed-forward Neural Network

How well did you know this?

Not at all

Perfectly

BP Mechanism

Back Propogation

How well did you know this?

Not at all

Perfectly

Gradient Decent

How well did you know this?

Not at all

Perfectly

T5 model

How well did you know this?

Not at all

Perfectly

seq2seq

sequence-to-sequence neural network

How well did you know this?

Not at all

Perfectly

multi-head attention

multiple modules of self-attention capturing different kinds of attentions.

How well did you know this?

Not at all

Perfectly

feed-forward

How well did you know this?

Not at all

Perfectly

masked multi-head attention

How well did you know this?

Not at all

Perfectly

linear

How well did you know this?

Not at all

Perfectly

softmax

Study These Flashcards

The softmax function is a mathematical function that converts a vector of real numbers into a probability distribution, where each value is between 0 and 1 and all values sum to 1.

input embeddings

Study These Flashcards

output embeddings

Study These Flashcards

tokenize

Study These Flashcards

vectorize

Study These Flashcards

positional encoding

Study These Flashcards

self-attention

Study These Flashcards

self-attention allows us to associate each word in the input with other words in the same sentence

query vector

Study These Flashcards

key vector

Study These Flashcards

value vector

embedding vector

residual connection

The original positional input embedding is then given the multi-headed attention output vector as an additional component

decoder

1. Multi-headedattention layer 2. Add and norm layers 3. Feed-forward layer

encoder

BERT

Bidirectional Encoder Representations from Transformers

BERT-Base

BERT-Base has a total of 110 million parameters, 12 attention heads, 768 hidden nodes, and 12 layers.

BERT-Large

BERT-Large is characterized by having 24 layers, 1024 hidden nodes, 16 attention heads, and 340 million parameter values.

Masked-LM

MLM

NSP

Next Sentence Prediction

CLS

Always consider the first token in a sequence to be a special classification token (also abbreviated CLS)

SEP

The [SEP] token serves to demarcate the break between the two sentences.

Introduction to Transformers for NLP Flashcards

(37 cards)