Contextualized Word Embeddings Flashcards

Question 1

Q

Static embeddings vs. contextualized embeddings

Answer

A

Word embeddings such as word2vec or GloVe learn a single vector for each type (unique word) in the vocabulary V. These are also called static embeddings.

By contrast, contextualized embeddings represent each token (word occurrence) by a different vector, according to the context the token appears in (also called pre-trained models, dynamic embeddings, or large language models).

Question 2

Q

How can we learn to represent words along with the context they occur in?

Answer

A

We train a neural network using ideas from language models:

predict a word from left context (GPT-3)
predict a word from left and right context independently (ELMo)
predict a word from left and right context jointly (BERT)

Question 3

Q

ELMo (Embeddings from Language Model)

Answer

A

ELMo looks at the entire sentence before assigning each word its embedding.

Character-level tokens are processed by a CNN (convolutional neural network) producing word-level embeddings.
These embeddings are processed by a left-to-right and a right-to-left 2-layers LSTM.
For each word, the output embeddings at each layer (including the CNN layer) are combined, producing contextual embeddings.

think about the structure…

Question 4

Q

BERT (Bidirectional Encoder Representations from Transformers)

Answer

A

BERT produces word representations by jointly conditioning on both left and right context (Created by researchers at Google AI Language) thanks to the self-attention mechanism that ranges over the entire input.

The model is based on the encoder component of the well-known Transformer neural network.

Input is segmented using subword tokenization and combined with positional embeddings
Input is passed through a series of standard transformer blocks consisting of self-attention and feedforward layers, augmented with residual connections and layer normalization.

think about the structure…

Question 5

Q

How BERT is pretrained?

Answer

A

The pretraining is performed by running 2 unsupervised learning tasks simultaneously:

Masked language modeling
Next sentence prediction

The result of the these 2 pre-training processes consists of the parameters of the encoder, which is used to produce contextual embeddings for novel sentences or sentence pairs.

Question 6

Q

BERT 1st learning goal: Masked language modeling

Answer

A

The model learns to perform a fill-in-the-blank task, technically called the cloze task.

A random sample of 15% of the input tokens are:

replaced with the unique vocabulary token [MASK] (80%)
replaced with another token randomly sampled with unigram probabilities (10%)
left unchanged (10%)

Question 7

Q

BERT 2nd learning goal: Next sentence prediction, input and output type

Answer

A

NSP training is used for tasks involving relationship between pairs of sentences, such as:

paraphrase detection: detecting if two sentences have similar meanings
sentence entailment: detecting if the meanings of two sentences entail or contradict each other
discourse coherence: deciding if two neighboring sentences form a coherent discourse

INPUT:

In NSP, the model is presented with pairs of sentences with:

token [CLS] prepended to the first sentence
token [SEP] placed between the two sentences and after the rightmost token of the second sentence
segment embeddings marking the first and second sentences are added to each word and positional embeddings (remember the image)

OUTPUT:

The output embedding associated with the [CLS] token represents the next sentence prediction.

Question 8

Q

GPT-n (Generative Pre-Training for language understanding)

Answer

A

Used for learning contextualized word embeddings.

Is a left-to-right language model based on Transformer’s decoder.

GPT-n can be used for:

token prediction / generation (LM)
sequence labelling
single sentence classification
sentence pairs classification

Question 9

Q

Adaptation

Answer

A

To make practical use of contextualized embeddings, we need to interface these models with downstream applications.

This process is called adaptation, and uses labeled data for the task of interest.

Two most common forms of adaptation are:

feature extraction: freeze the pre-trained parameters of the language model and train parameters of the model for the task at hand
fine-tuning: make (possibly minimal) adjustments to the
pre-trained parameters.

Question 10

Q

Adapters

Answer

A

When working with huge pre-trained models, fine-tuning may still be inefficient.

Alternatively, one could fix the pre-trained model, and train only small, very simple components called adapters.

With adapter modules transfer becomes very efficient: the largest part of the pre-trained model is shared between all downstream tasks.

Question 11

Q

Contextualized embeddings ethics

Answer

A

Contextual language models can generate toxic language, misinformation, radicalization, and other socially harmful activities
Contextual language models can leak information about their training data. It is possible for an adversary to extract individual data from a language model (phishing).

Unsolved research problem in NLP.

Brainscape's Knowledge GenomeTM

Contextualized Word Embeddings Flashcards

Brainscape's Knowledge Genome^TM