Topic 8: Transfer Learning & Autoregressive LLMs Flashcards

Question 1

Q

What are common data limitations in deep learning tasks?

Answer

A

deep learning success: it has big architectures, it has frameworks, it has hardware, and it can work with big data

some limitations of the data is:
- there’s not enough variance in the data, which leads to overfitting
- there’s not enough data for a specific task
- only a small amount of the data is labelled

Question 2

Q

How do we overcome the data limitations?

Answer

A

to overcome the dataset limitations, what we can do is:

data augmentation: Introduce variations to existing data
transfer learning: Leverage learned features from other domains
self-supervised learning: Generate labels from the structure of the data itself
semi-supervised learning: Use small labeled sets with large unlabeled data

Question 3

Q

What is the effect of data augmentation?

Answer

A

remember: overfitting is when our model has little to no generalisation.

Data augmentation synthetically expands the training set by modifying data instances (e.g., rotation, cropping, synonym replacement). This helps models generalise better and prevents overfitting by exposing them to a broader data distribution.

the general idea is that we want to replace the empirical distribution with a smoothed distribution instead.

the approach for this is to build an automated augmentation in the data loader. what we can get from this is data that has been changed a bit, so it doesn’t look too much like the training data

Question 4

Q

What is the difference between supervised and unsupervised pre-training?

Answer

A

pre-training: the general approach is to teach the basic structure of the domain dataset to a similar downstream task. this helps the model learn useful patterns or representations that transfer well.

supervised pre-training: this is very straightforward, what we do is use a large labelled dataset, but we need that the fine-tuning data have sufficiently similar characteristics. so when the model is pre-trained on say imagenet, and the task is a vision task, it is easier to fine-tune on that kind of data. if the task is an nlp task, it might be a bit more difficult

unsupervised pre-training: we train a model on data without using any labels. it’s a bit more conventual, we optimise the networks for reconstruction error without labels, this can be done using generative models

Question 5

Q

How does self-supervised learning work?

Answer

A

It constructs pseudo-labels from the input data, such as masking tokens in text or patches in images, and trains the model to predict them. This enables the model to learn meaningful representations without manual annotation.

Question 6

Q

What could be difficulties in Transfer Learning?

Answer

A

One of the main challenges of Transfer Learning is thedomain difference between the source dataset and the target dataset. When the distribution of data is significantly different, the pre-trained model may not be able to transfer knowledge effectively, resulting in lower-than-expected performance.

Another challenge in transfer learning isfinding the right balance between overfitting and generalisation. Transferring too much knowledge from the source domain may lead to overfitting on the target domain, while transferring too little may hinder generalization.

Question 7

Q

What is the goal of transfer learning?

Answer

A

when a model has been pre-trained on one domain, and then fine-tuned on another. Say our model is pre-trained on feature extractions on ImageNet, and then fined-tuned on bird detection on a special dataset

So the pre-training is on the object detection on ImageNet, which is:
- huge
- has diverse contexts
- has sufficient labels

the task would be to do some rare bird classification, which has:
- few examples
- sparse context
- limited data

So a model that has been pre-trained on one domain and fine-tuned for another is transfer learning

Question 8

Q

What is contrastive learning?

Answer

A

contrastive learning is when the model learns a representation by comparing the pairs of inputs. it learns to:
- recognise similar objects, by pulling them closer in the representation space
- contrast dissimilar objects, by pushing them apart

An example of this is SIMCLR. SimCLR is a framework for contrastive learning of visual representations:
- you apply stochastic data augmentation to an image, to create two views, that will preserve the semantics
- it creates stochastic semantics-preserving transformations

SIMCLR is what’s also called a siamese network. This is because you feed two inputs to the network, conceptually it’s like there are two identical networks that share the same weights.

The goal is to:
- Maximize agreement between the views (positive pairs)
- Minimize agreement with other images (negative pairs)

Question 9

Q

What is the CLIP model?

Answer

A

Contrastive Language-Image Pre-training (CLIP)
- it combined a ResNet-based image and Transformer-based text encoders
- its trained on huge noisy labelled-images data

A key example is CLIP (Contrastive Language-Image Pretraining), which jointly trains an image encoder (like a ResNet) and a text encoder (like a Transformer) to embed images and their corresponding captions into the same space. During training, CLIP uses a large set of (image, text) pairs and learns to associate matching pairs by pulling them closer in embedding space, while pushing apart mismatched pairs. This enables CLIP to generalize across a wide variety of vision-language tasks, such as zero-shot image classification, where no task-specific fine-tuning is performed.

CLIP trains image and text encoders to produce embeddings that are aligned for paired (image, caption) inputs, using a contrastive loss that encourages matching pairs to be closer than mismatched ones.

Question 10

Q

What is domain adaptation in transfer learning?

Answer

A

in many cases the source domain and the target domain are different from each other:
- example: graphical images vs. real photos
- example: movie reviews vs. product reviews

the goal is to fit the model to the source domain, and then we modify the parameters to be compatible to the target domain

if the model is trained on the source domain, and then applied to a specific target domain, it will lead to a bad fit. we therefore need adaptation, e.g. as a shift of some sort

domain adaptation: it’s a vast but task-specific field of how we can let a classifier learn from a source and thereafter generalise to a target domain

Question 11

Q

What kind of learnings are there?

Answer

A

Zero-shot: the ability to do many tasks with no examples, and no gradient updates, by simply:

comparing the probabilities of sequences
specifying the right sequence prediction problem (e.g. question answering)

Zero-shot: You only give the task description , like “Translate English to French: “sea otter ⇒ loutre de mer”, “cheese ⇒”. The model must generalize without any examples.

One-shot: You give one example along with the prompt, e.g., “sea otter ⇒ loutre de mer” → Then ask it to translate “cheese ⇒”.

Few-shot: You give multiple examples, like a mini training set in the prompt, then ask for a new translation.
If we specify a task by simply prepending examples of the task before you example

in-context learning: no gradient updates are performed when learning a new task
there is an emergent property of model scaling

Question 12

Q

What is fine-tuning in transfer learning?

Answer

A

we have frozen layers:
- we reuse the same layers (dimensionality, activation functions and so on)
- copy the trained weight matrices from the source model
- often we can apply different learning rates. for smaller pre-trained networks, it should be higher

for downstream tasks:
- we can adapt to specialised tasks, such as related domains
- but it might sometimes need transformations, such as resizing, colour conversion and so on, to match the expected format of the model

there are some difficulties:
- we can encounter catastrophic forgetting, where the model forgets what it has learned earlier, if it gets too fine-tuned too aggressively
- the training can be slow, especially if there are many layers that need to be fine-tuned

Question 13

Q

What are some advanced fine-tuning methods?

Answer

A

fine-tuning can be seen as a continuation of pre-training
- fine-tuning is downstreaming
- there are partially frozen weights

parameter efficient fine-tuning
- we create new parameters and adapt them to the new domain
- the pre-training model remains frozen

Masked language modelling and multi-task fine-tuning
- we add a new specialised head (e.g. classification)
- it can adapt to a domain
- the pre-trained models may be slightly trained

instruction tuning:
- there’s modest training with instructions

Question 14

Q

What is the difference between BERT and GPT in architecture and pretraining?

Answer

A

BERT and GPT: both of which are trained using unsupervised learning on large corpora of plain text. While both models use the transformer architecture introduced by Vaswani et al., they focus on different parts of it and follow different pretraining strategies.

BERT:
- focuses on transformer encoder blocks
- Its pretraining involves a task called Masked Language Modeling (MLM), where some tokens in the input sentence are randomly masked and the model is trained to predict them, and other multi-task learning methods

GPT:
- Focuses on transformer decoder blocks
- Its pretraining includes tasks based on sentence entailment and contradiction, where the model learns to determine whether one sentence logically follows from another. This makes GPT particularly well-suited for generative tasks, like text completion or story generation.

“BERT and GPT are word2vec and CBoW embedding models with the depth of all used texts”

Question 15

Q

What is in-context learning and induction heads?

Answer

A

our insight so far → in-context learning:
- prompting the model with demonstrations can teach the model to do a new task
- the further in a model gets fed a prompt, the better it gets at predicting the upcoming tokens
- TODO

induction head hypothesis
- predict the repeated sequences: A B … A → B ⇒ pattern completion
- induction head has a prefix matching component, if A is found, then B copies the pattern

Question 16

Q

What is RLHF (Reinforcement Learning from Human Feedback)?

Answer

Study These Flashcards

A

A training method where human preferences are used to create a reward model. This reward guides fine-tuning of language models via reinforcement learning to improve helpfulness and safety.

collects demonstration data and train a supervised policy
Collect comparison data and train reward model
Optimise a policy against the reward model using reinforcement learning

Question 17

Q

What is multi-task learning?

Answer

Study These Flashcards

A

Multi-task learning is when we simultaneously train a task on different special target domains.

how do we condition the model for individual tasks?

should each task:
- Use completely independent heads? (no shared parameters)
- Share a base and only differ in the heads (multi-head setup)?

How to define the training objective?
How to combine losses from different tasks?
- Use fixed weights for each task?
- Or dynamically learn how much each task contributes (e.g., based on uncertainty or task difficulty)?

How to optimize this?
Options:
- Sample a mini-batch of tasks, each with its own data.
- sample a mini-batch of datapoints each

Question 18

Q

Explain the masking approach e.g. in BERT pre-training

Answer

Study These Flashcards

A

it has bidirectional context and can condition on the future
but how do we train them to build strong representations?
- the pre-training of language models, mean that we predict the next token
- encoders can get bidirectional context, we can’t do language modelling
  - Encoders “see” the whole sentence (past and future). That breaks traditional language modeling, which requires predicting words only from left-to-right. So, BERT uses a different approach: masking random words instead.

idea 1: replace some fraction of the word in the input with a special [MASK] token, and then we try to predict words: BERT learns to build strong word representations by randomly replacing a portion of the input words with a special [MASK] token and training the model to predict these masked words using the surrounding context:

h_1, …, h_T = Encoder(w_1, …, W_T)
y_i ~ Ah_i + b

It only adds a loss term from words that are “masked out”. if ~x is the masked version of x, we’re learning p_θ(x|~x) which is called masked LM

Idea 2: Predict a random (e.g. 10%) random (subword) tokens. Replace the input word with [MASK] 80% of the time. Replace the input word with a random token 10% of the time. Leave the input word unchanged 10% of the time. Model does not build strong representations of non-masked words.

here the model does not build strong representations of non-masked words

Question 19

Q

How does Self-Supervised Pre-Training work?

Answer

Study These Flashcards

A

self-supervision is when we supervise the data without using labels that are generated from the data without label sources. So we kinda generate the labels automatically from the data itself.

and example of this is object features:
- we use data augmentation, to teach the models some invariance
- these pretext tasks help build the models a rich feature map, and can recognise a cat, even if the image is rotated

word embeddings:
- we can use language modelling as a pretext task, such as predicting the next word, a masked word and so on
- we create these word embeddings that can capture the semantic meanings
- these embeddings can therefore be reused in the downstream NLP tasks

we initialise the first layer with a word2vec embeddings so we get the rich feature maps

Question 20

Q

What is chain-of-thought prompting?

Answer

Study These Flashcards

A

A prompting technique where the model is encouraged to generate intermediate reasoning steps before producing a final answer, improving performance on tasks requiring logical or multi-step thinking.

Question 21

Q

Explain post-training and model alignment

Answer

Study These Flashcards

A

prompting has its limits, sometimes it doesn’t work, and the LLM will fail, because it’s not sufficiently helpful

consequences so far:

LLMs can simultaneously be too harmful
it can generate text that is false
it can generate text that is toxic

reasons for this?

the pre-training objective is misaligned with the human need for models to be helpful and non-harmful

Topic 8: Transfer Learning & Autoregressive LLMs Flashcards

(21 cards)