Week 8: Transfer Learning & Autoregressive LLMs Flashcards
(15 cards)
What are the main solutions to dataset limitations in deep learning, and what specific problems do they address?
Not enough variance in data → Data Augmentation
Not enough data for specific task → Transfer Learning
Small amount of labeled data → Self-supervised Learning & Semi-supervised Learning
What is data augmentation and what are the main image augmentation techniques? How does it relate to overfitting?
Data Augmentation:
Purpose:
Increase training data to combat overfitting
Replace empirical distribution with smoothed distribution
Build automated augmentation in data loader
Connection to Overfitting:
Overfitting occurs when model memorizes training data
Data augmentation creates variations of existing data
Forces model to learn more generalizable features
Main Image Augmentation Techniques:
Rotation - Rotate image by degrees
Crop - Take sections of original image
Colour-shift - Modify color channels/hue
Noise - Add random noise to pixels
Loss - Remove/mask parts of image
Mirror - Horizontal/vertical flipping
What is the long tail problem in data distribution?
Long Tail Problem:
Most data follows long tail distribution: few categories have big data, many have small data
Examples: Objects/subjects encountered, interactions with people, words heard, driving scenarios
What are the different approaches to pre-training and how do supervised and unsupervised pre-training differ?
General Principle:
Teach basic structure of domain similar to downstream task
Supervised Pre-training:
Method: Use large labelled dataset
Requirement: Need sufficiently similar characteristics between pre-training and target data
Examples:
Vision tasks: ImageNet (image → “Cat” label)
NLP tasks: ASR, NLU (English corpora)
Unsupervised Pre-training:
Conventional: Optimize networks for reconstruction error without labels
Leads to: Generative models (covered later)
Modern Approach: Self-supervised learning on discriminative tasks
Self-supervised Learning:
Concept: Create supervision signal from the data itself
Process: Self-supervised model → Downstream layers → Final prediction
Key: “Downstream” refers to later layers in the processing stream
Advantage: Can leverage unlabeled data while still learning useful representations for discriminative tasks.
What are the three parameter freezing strategies in transfer learning based on dataset size?
Small Dataset:
Strategy: Freeze all layers
Training: Only train weights on output layer
Rationale: Prevent overfitting with limited data
Medium Dataset:
Strategy: Partial freezing
Training: Train weights on last layers & output
Rationale: Balance adaptation with overfitting prevention
Large Dataset:
Strategy: Minimal freezing
Training: Train weights on all layers using pre-trained weight initialization
Rationale: Sufficient data allows full network adaptation
Key Principle: More data allows training more parameters without overfitting.
What is the fine-tuning process in transfer learning and what are its main challenges?
Fine-Tuning Process:
Technical Steps:
Architecture: Use same layers (dimensionality, activation functions)
Weights: Copy exported weight matrices from source model
Learning rates: Often use different (lower) rates for pre-trained layers
Preprocessing: May need transformations (resize, color conversion)
Downstream Tasks:
Specialized task in similar domain
Example: ImageNet → Rare bird classification
Main Difficulties:
Catastrophic forgetting: Loss of valuable pre-trained knowledge during fine-tuning
Slow training: Fine-tuning process can be computationally expensive
Key Insight: Balance preserving pre-trained knowledge while adapting to new task.
What is self-supervised pre-training and what are the main categories of self-supervised tasks?
BACK:
Self-Supervised Pre-Training:
Definition:
Supervise using labels generated from data without external label sources
Main Categories:
1. Imputation Tasks:
Method: Partition input into x = (xh, xn)
Goal: Train predicting part xh = f(xv, xh = 0) - “fill-in-the-blank”
Examples: Cloze task (NLP), image inpainting
Reference: Pathak et al. 2016
- Proxy/Pretext Tasks:
Method: Pair input into tuples with semantic relationship r
Goal: Learn function of relationship p(y|x1, x2) = p(y|r[f(x1), f(x2)])
Example: Learning “rotation” relationships between images
Reference: Gidaris et al. 2018
- Contrastive Learning:
Method: Recognize similar objects, contrast to other objects
Example: SimCLR - maximize agreement for similar views, minimize for different views
How does contrastive learning work and what is the SimCLR approach?
Contrastive Learning:
Core idea: Learn representations by pulling similar things together and pushing dissimilar things apart in the embedding space.
The Basic Setup
Anchor: Original image of a cat
Positive: Same cat image with different augmentation (rotation, crop, etc.)
Negatives: Images of dogs, cars, etc.
Embed everything: Pass through neural network to get embeddings
Contrastive loss:
Pull anchor and positive embeddings closer
Push anchor and negative embeddings apart
What is CLIP?
CLIP is a multimodal model that learns to understand images and text together by training on image-text pairs from the internet.
What is domain adaptation and what are the main approaches in the domain adaptation taxonomy?
Source domain & target domains are different
Example: Graphical images vs. real photos, movie reviews vs. product reviews
Result: Model trained on source domain has bad fit when applied to target domain
Goal: Fit model to source domain, then modify parameters to be compatible with target domain
Main Approaches:
1. Model-centric:
Feature-centric: Pivot-based, SFA, PBLM, autoencoders
Loss-centric: Adversarial losses, DANN, DSN, reweighting
- Data-centric:
Pseudo-labeling: Self-labeling, adaptive ensembling, co-training
Data selection: Jensen-Shannon, perplexity, dynamic selection
- Hybrid:
Pre-training: AdaptaBERT, DAPT, TAPT, STILT
Key Need: Adaptation techniques like shift of age, domain-specific adjustments
Reference: Ramponi & Plank 2020
What is Multi-task Learning (MTL) and how does it work with shared architectures?
Multi-task Learning (MTL):
Definition:
Simultaneously train task on different special target domains
Examples:
Vision: Different objects, occlusions, lighting, translations
NLP: Syntactic or semantic relationships
BERT Multi-task Example:
Sentence pair classification
Single sentence classification
Question answering
Single sentence tagging
Architecture:
Shared parameters (θs): Common feature extraction layers
Task-specific parameters (θ1, θ2, θn): Individual output heads for each task
Multiple heads: Same model with different task-specific inputs & outputs
Key Benefit: Shared representations learn generalizable features across related tasks
Reference: BERT/Devlin et al. 2019
Why can’t traditional autoregressive language modeling be used for BERT encoder pre-training, and how does Masked Language Model (MLM) solve this?
BACK:
Problem with Traditional Language Modeling:
Traditional LM: Autoregressive prediction (predict next token from left-to-right only)
Requires: Causal attention masking (can only see previous tokens)
Encoder needs: Bidirectional attention for downstream tasks
Mismatch: Pre-training (unidirectional) vs. fine-tuning (bidirectional)
Masked Language Model Solution:
Keep: Bidirectional attention (no causal masking)
Create prediction task: Replace words with [MASK] tokens, predict original words
Formula: h₁, …, hₜ = Encoder(w₁, …, wₜ), then yᵢ ∼ Ahᵢ + b
Loss: Only from masked positions: pθ(x|x̃) where x̃ is masked version
Key Insight: MLM enables bidirectional pre-training that matches bidirectional fine-tuning usage.
How does BERT implement the masking strategy and what are its three-level embeddings?
BERT Three-Level Embeddings:
Token Embeddings: Subword vectors of size 768
Segment Embeddings: 2 labels (for sentence pair tasks)
Position Embeddings: 512 labels (sequence position)
Masking Strategy (for random 15% of tokens):
80%: Replace with [MASK] token
10%: Replace with random token
10%: Leave unchanged
Example: “I pizza to the [M]”
“pizza” = [Replaced with random]
“to” = [Not replaced]
“[M]” = [Masked]
Why Mixed Strategy:
Model cannot build strong representations of only non-masked words - forces learning from all positions and contexts.
Result: Strong bidirectional representations for downstream tasks.
What is the alignment problem in language models and how does human feedback help solve it?
Language Modelling ≠ Assisting Users
Alignment Problem:
Language models are not aligned with user intent
Example Issues:
Prompt: “Explain the moon landing to a 6 year old in a few sentences”
GPT-3 Completions:
“Explain the theory of gravity to a 6 year old”
“Explain the theory of relativity to a 6 year old in a few sentences”
“Explain the big bang theory to a 6 year old”
“Explain evolution to a 6 year old”
Problem: Model continues the pattern rather than following the instruction
Solution: Human Feedback
Example: Human response about moon landing with rocket ship, astronauts, lunar surface exploration
Method: Finetuning with human feedback for the rescue!
Goal: Align model outputs with user intentions rather than just pattern completion
Key Challenge: Standard language modeling objective doesn’t match user assistance goals
What is the RLHF (Reinforcement Learning from Human Feedback) pipeline used in ChatGPT?
RLHF Pipeline (ChatGPT Method):
Three-Stage Process:
1. Collect Demonstration Data & Train Supervised Policy:
Method: Instruction tuning
Data: Human demonstrations of desired outputs
Goal: Basic instruction-following capability
- Collect Comparison Data & Train Reward Model:
Method: Maximize rewards
Process: Humans rank multiple model outputs
Goal: Learn human preference function R(x,y)
- Optimize Policy Against Reward Model:
Method: Reinforcement learning (PPO)
Goal: Maximize expected reward: E_{y~π_θ}[R(x,y)]
Constraint: Stay close to supervised model
Key Innovation:
RL: Partial observable problems
Difficult: Define reward function
Human feedback: Cheap (AWS)
PPO: Optimize in policy space, trust region derived by SGD
Reference: Ouyang et al./OpenAI 2022
Result: ChatGPT - highly funded labor with MS licensed OpenAI-GPT for 185 highly curated & augmented text, 40 trained labellers + AWS