C7 Flashcards

1
Q

Recurrent Neural Networks (RNN)

A
  • connections between the hidden layers of subsequent ‘time steps’ (words in a text)
  • internal state that is updated in every time step
  • hidden layer weights determine how the network should make use of past context in calculating the output for the current input (trained via backpropagation)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

LSTM

A

Long Short-Term Memory: more powerful (and more complex) RNNs that take longer contexts into account by removing information no longer needed from the context and adding information likely to be needed for later decision making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

bi-LSTM

A

bidirectional neural model for NER:
- first, word and character embeddings are computed for input word w_i and the context words
- these are passed through a bidirectional LSTM, whose outputs are concatenated to produce a single output layer at position i

Simplest approach: direct pass to softmax layer to choose tag t_i

But for NER the softmax approach is insufficient: strong constraints for neighboring tokens needed (e.g., the tag I-PER must follow I-PER or B-PER) => Use CRF layer on top of the bi-LSTM output: biLSTM-CRF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

transformer models

A
  • encoder-decoder architecture
  • much more efficient than Bi-LSTMs and other RNNs because input is processed in parallel instead of sequentially
  • can model longer-term dependencies because the complete input is processed at once
  • but it uses a lot of memory because of quadratic complexity: O(n^2) for input length of n items
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

the attention mechanism

A

When processing each item in the input, the model has access to all of the input items

Self-attention: each input token is compared to all other input
tokens
=> comparison: dot product of each two vectors (the larger the value the more similar the vectors that are being compared)

  • Self-attention represents how words contribute to the representation of longer inputs and how strongly words are related to each other => allows us to model longer-distance relations between words

Disadvantage: attention is quadratic in the length of the input (computing dot products between each pair of tokens in the input at each layer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

BERT

A

Pre-training of Deep Bidirectional Transformers for Language
Understanding

  1. Pre-training: language modelling
  2. Bidirectional: predicting randomly masked words in context
  3. Transformers: efficient neural architectures with self-attention
  4. Language understanding: encoding, not decoding (not generation)

endoder-half of the transformer
Core idea of BERT: self-supervised pretraining based on language modelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

masked language modelling

A
  1. Predicting randomly masked words in context to capture the meaning of words
  2. Next-sentence classification to capture the relationship between sentences

both are trained in parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

WordPiece

A

specific type of tokenization used by BERT

Fixed-size vocabulary is defined to model huge corpora
The WordPiece vocabulary is optimized to cover as many words as possible
- frequent words are single tokens, e.g. “walking” and “talking”
- less frequent words are split into subwords, e.g. “bi” + “##king”, “bio” + “##sta” + “##tist” + “##ics”
- this is not linguistically motivated, but purely computationally

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

success of BERT

A
  • achieves state-of-the-art results on a large range of tasks and even in a large range of domains
  • pre-trained models can easily be fine-tuned
  • pre-trained models are available for many languages, as well as domain-specific pre-trained BERT models: bioBERT etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

BERT for similarity

A

With BERT, if we want to compute the similarity (or some other relation) between two sentences, we concatenate them in the input and then feed them to the BERT encoder

Finding the most similar pair in a collection of 10,000 sentences takes about 65 hours with BERT.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SBERT

A
  • independent encoding of two sentences with a BERT encoder
  • then measure similarity between the two embeddings

=> reduces the effort for finding the most similar pair from 65 hours with BERT to about 5 seconds with SBERT, while maintaining the accuracy from BERT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

transfer learning with neural language models

A

Inductive transfer learning: transfer the knowledge from pretrained language models to any NLP task

  1. During pre-training, the model is trained on unlabeled data (selfsupervision) over different pre-training tasks
  2. For finetuning, the BERT model is first initialized with the pre-trained parameters
  3. All of the parameters are fine-tuned using labeled data from the downstream tasks (supervised learning)

Each downstream task has separate fine-tuned models, even
though they are initialized with the same pre-trained parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

zero-shot use

A

using a pre-trained model without fine-tuning

We also use the term ‘zero-shot’ for the use of models that were fine-tuned by someone else or on a different task, eg.
- trained on newspaper benchmark, applied to twitter data
- trained on English, used for Dutch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

few-shot learning

A

fine-tuning with a small number of samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

challenges of state-of-the art methods

A

time and memory expensive:
- pre-training takes time (days) and computing power
- fine-tuning takes time (hours) and computing power
- use of a fine-tuned model (inference) needs computing power

Hyperparameter tuning:
- optimization on development set takes time
- adoption of hyperparameters from pre-training task might be suboptimal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

BERT input

A

input embeddings are the sum of the token embeddings, segmentation embeddings and position embeddings

17
Q

Semi-supervised relation extraction via bootstrapping

A

example: find airline/hub pairs and we only know Ryanair has a hub at Charleroi
1. search for terms “Ryanair”, “Charleroi” and “hub” in some proximity to find example sentences
2. extract general patterns from these examples, eg.
/ [ORG], which uses [LOC] as a hub /
3. use these patterns to search for additional tuples
4. assign confidence values to new tuples to avoid semantic drift (avoid erroneous patterns and tuples)

18
Q

two parts that BERT consists of

A
  1. pre-training: language modelling (takes a long time)
  2. fine-tuning: training the model specific to a task (sentiment analysis, NER, question answering)