Week 8: Transfer Learning & Autoregressive LLMs Flashcards

Question 1

Q

What are the main solutions to dataset limitations in deep learning, and what specific problems do they address?

Answer

A

Not enough variance in data → Data Augmentation
Not enough data for specific task → Transfer Learning
Small amount of labeled data → Self-supervised Learning & Semi-supervised Learning

Question 2

Q

What is data augmentation and what are the main image augmentation techniques? How does it relate to overfitting?

Answer

A

Data Augmentation:
Purpose:

Increase training data to combat overfitting
Replace empirical distribution with smoothed distribution
Build automated augmentation in data loader

Connection to Overfitting:

Overfitting occurs when model memorizes training data
Data augmentation creates variations of existing data
Forces model to learn more generalizable features

Main Image Augmentation Techniques:

Rotation - Rotate image by degrees
Crop - Take sections of original image
Colour-shift - Modify color channels/hue
Noise - Add random noise to pixels
Loss - Remove/mask parts of image
Mirror - Horizontal/vertical flipping

Question 3

Q

What is the long tail problem in data distribution?

Answer

A

Long Tail Problem:

Most data follows long tail distribution: few categories have big data, many have small data
Examples: Objects/subjects encountered, interactions with people, words heard, driving scenarios

Question 4

Q

What are the different approaches to pre-training and how do supervised and unsupervised pre-training differ?

Answer

A

General Principle:
Teach basic structure of domain similar to downstream task
Supervised Pre-training:

Method: Use large labelled dataset
Requirement: Need sufficiently similar characteristics between pre-training and target data
Examples:

Vision tasks: ImageNet (image → “Cat” label)
NLP tasks: ASR, NLU (English corpora)

Unsupervised Pre-training:

Conventional: Optimize networks for reconstruction error without labels
Leads to: Generative models (covered later)
Modern Approach: Self-supervised learning on discriminative tasks

Self-supervised Learning:

Concept: Create supervision signal from the data itself
Process: Self-supervised model → Downstream layers → Final prediction
Key: “Downstream” refers to later layers in the processing stream

Advantage: Can leverage unlabeled data while still learning useful representations for discriminative tasks.

Question 5

Q

What are the three parameter freezing strategies in transfer learning based on dataset size?

Answer

A

Small Dataset:

Strategy: Freeze all layers
Training: Only train weights on output layer
Rationale: Prevent overfitting with limited data

Medium Dataset:

Strategy: Partial freezing
Training: Train weights on last layers & output
Rationale: Balance adaptation with overfitting prevention

Large Dataset:

Strategy: Minimal freezing
Training: Train weights on all layers using pre-trained weight initialization
Rationale: Sufficient data allows full network adaptation

Key Principle: More data allows training more parameters without overfitting.

Question 6

Q

What is the fine-tuning process in transfer learning and what are its main challenges?

Answer

A

Fine-Tuning Process:
Technical Steps:

Architecture: Use same layers (dimensionality, activation functions)
Weights: Copy exported weight matrices from source model
Learning rates: Often use different (lower) rates for pre-trained layers
Preprocessing: May need transformations (resize, color conversion)

Downstream Tasks:

Specialized task in similar domain
Example: ImageNet → Rare bird classification

Main Difficulties:

Catastrophic forgetting: Loss of valuable pre-trained knowledge during fine-tuning
Slow training: Fine-tuning process can be computationally expensive

Key Insight: Balance preserving pre-trained knowledge while adapting to new task.

Question 7

Q

What is self-supervised pre-training and what are the main categories of self-supervised tasks?

Answer

A

BACK:
Self-Supervised Pre-Training:
Definition:
Supervise using labels generated from data without external label sources
Main Categories:
1. Imputation Tasks:

Method: Partition input into x = (xh, xn)
Goal: Train predicting part xh = f(xv, xh = 0) - “fill-in-the-blank”
Examples: Cloze task (NLP), image inpainting
Reference: Pathak et al. 2016

Proxy/Pretext Tasks:

Method: Pair input into tuples with semantic relationship r
Goal: Learn function of relationship p(y|x1, x2) = p(y|r[f(x1), f(x2)])
Example: Learning “rotation” relationships between images
Reference: Gidaris et al. 2018

Contrastive Learning:

Method: Recognize similar objects, contrast to other objects
Example: SimCLR - maximize agreement for similar views, minimize for different views

Question 8

Q

How does contrastive learning work and what is the SimCLR approach?

Answer

A

Contrastive Learning:

Core idea: Learn representations by pulling similar things together and pushing dissimilar things apart in the embedding space.

The Basic Setup

Anchor: Original image of a cat
Positive: Same cat image with different augmentation (rotation, crop, etc.)
Negatives: Images of dogs, cars, etc.

Embed everything: Pass through neural network to get embeddings

Contrastive loss:
Pull anchor and positive embeddings closer
Push anchor and negative embeddings apart

Question 9

Q

What is CLIP?

Answer

A

CLIP is a multimodal model that learns to understand images and text together by training on image-text pairs from the internet.

Question 10

Q

What is domain adaptation and what are the main approaches in the domain adaptation taxonomy?

Answer

A

Source domain & target domains are different

Example: Graphical images vs. real photos, movie reviews vs. product reviews
Result: Model trained on source domain has bad fit when applied to target domain

Goal: Fit model to source domain, then modify parameters to be compatible with target domain
Main Approaches:
1. Model-centric:

Feature-centric: Pivot-based, SFA, PBLM, autoencoders
Loss-centric: Adversarial losses, DANN, DSN, reweighting

Data-centric:

Pseudo-labeling: Self-labeling, adaptive ensembling, co-training
Data selection: Jensen-Shannon, perplexity, dynamic selection

Hybrid:

Pre-training: AdaptaBERT, DAPT, TAPT, STILT

Key Need: Adaptation techniques like shift of age, domain-specific adjustments
Reference: Ramponi & Plank 2020

Question 11

Q

What is Multi-task Learning (MTL) and how does it work with shared architectures?

Answer

A

Multi-task Learning (MTL):
Definition:
Simultaneously train task on different special target domains
Examples:

Vision: Different objects, occlusions, lighting, translations
NLP: Syntactic or semantic relationships

BERT Multi-task Example:

Sentence pair classification
Single sentence classification
Question answering
Single sentence tagging

Architecture:

Shared parameters (θs): Common feature extraction layers
Task-specific parameters (θ1, θ2, θn): Individual output heads for each task
Multiple heads: Same model with different task-specific inputs & outputs

Key Benefit: Shared representations learn generalizable features across related tasks
Reference: BERT/Devlin et al. 2019

Question 12

Q

Why can’t traditional autoregressive language modeling be used for BERT encoder pre-training, and how does Masked Language Model (MLM) solve this?

Answer

A

BACK:
Problem with Traditional Language Modeling:

Traditional LM: Autoregressive prediction (predict next token from left-to-right only)
Requires: Causal attention masking (can only see previous tokens)
Encoder needs: Bidirectional attention for downstream tasks
Mismatch: Pre-training (unidirectional) vs. fine-tuning (bidirectional)

Masked Language Model Solution:

Keep: Bidirectional attention (no causal masking)
Create prediction task: Replace words with [MASK] tokens, predict original words
Formula: h₁, …, hₜ = Encoder(w₁, …, wₜ), then yᵢ ∼ Ahᵢ + b
Loss: Only from masked positions: pθ(x|x̃) where x̃ is masked version

Key Insight: MLM enables bidirectional pre-training that matches bidirectional fine-tuning usage.

Question 13

Q

How does BERT implement the masking strategy and what are its three-level embeddings?

Answer

A

BERT Three-Level Embeddings:

Token Embeddings: Subword vectors of size 768
Segment Embeddings: 2 labels (for sentence pair tasks)
Position Embeddings: 512 labels (sequence position)

Masking Strategy (for random 15% of tokens):

80%: Replace with [MASK] token
10%: Replace with random token
10%: Leave unchanged

Example: “I pizza to the [M]”

“pizza” = [Replaced with random]
“to” = [Not replaced]
“[M]” = [Masked]

Why Mixed Strategy:
Model cannot build strong representations of only non-masked words - forces learning from all positions and contexts.
Result: Strong bidirectional representations for downstream tasks.

Question 14

Q

What is the alignment problem in language models and how does human feedback help solve it?

Answer

A

Language Modelling ≠ Assisting Users
Alignment Problem:
Language models are not aligned with user intent
Example Issues:

Prompt: “Explain the moon landing to a 6 year old in a few sentences”
GPT-3 Completions:

“Explain the theory of gravity to a 6 year old”
“Explain the theory of relativity to a 6 year old in a few sentences”
“Explain the big bang theory to a 6 year old”
“Explain evolution to a 6 year old”

Problem: Model continues the pattern rather than following the instruction
Solution: Human Feedback

Example: Human response about moon landing with rocket ship, astronauts, lunar surface exploration
Method: Finetuning with human feedback for the rescue!
Goal: Align model outputs with user intentions rather than just pattern completion

Key Challenge: Standard language modeling objective doesn’t match user assistance goals

Question 15

Q

What is the RLHF (Reinforcement Learning from Human Feedback) pipeline used in ChatGPT?

Answer

A

RLHF Pipeline (ChatGPT Method):
Three-Stage Process:
1. Collect Demonstration Data & Train Supervised Policy:

Method: Instruction tuning
Data: Human demonstrations of desired outputs
Goal: Basic instruction-following capability

Collect Comparison Data & Train Reward Model:

Method: Maximize rewards
Process: Humans rank multiple model outputs
Goal: Learn human preference function R(x,y)

Optimize Policy Against Reward Model:

Method: Reinforcement learning (PPO)
Goal: Maximize expected reward: E_{y~π_θ}[R(x,y)]
Constraint: Stay close to supervised model

Key Innovation:

RL: Partial observable problems
Difficult: Define reward function
Human feedback: Cheap (AWS)
PPO: Optimize in policy space, trust region derived by SGD

Reference: Ouyang et al./OpenAI 2022
Result: ChatGPT - highly funded labor with MS licensed OpenAI-GPT for 185 highly curated & augmented text, 40 trained labellers + AWS

Week 8: Transfer Learning & Autoregressive LLMs Flashcards

(15 cards)