Week 10: Generative Adversarial & Diffusion Learning Flashcards

Question 1

Q

What are GANs and how do they work?

Answer

A

GAN: Builds generative models through competition between two neural networks
Components:

Generator: Takes random noise z → generates fake data x to fool discriminator

Discriminator: Takes real data + fake data → classifies as real (1) or fake (0)
Training process: Adversarial game where:

Generator goal: Maximize discriminator’s mistakes (generate realistic fakes)

Discriminator goal: Minimize mistakes (correctly identify real vs fake)

Key insight: Through competition, generator learns to produce increasingly realistic data without explicit reconstruction loss

Question 2

Q

Describe the step-by-step training procedure for a GAN.

Answer

A

Training steps (repeat for k steps):

Initialize both networks
Train Discriminator:

Sample noise batch: z_m ~ q(z)
Sample real data batch: x_m ~ p*(x)
Update discriminator with SGD: ∇_φ [log(D_φ(x_m)) + log(1-D_φ(G_θ(z_m)))]

Train Generator:

Sample new noise batch: z_m ~ q(z)
Update generator with SGD: ∇_θ log(D_φ(G_θ(z_m)))

Repeat for many epochs

Question 3

Q

Why is GAN training difficult and unstable?

Answer

A

Problem: GANs involve a minimax game between two competing networks:

Discriminator: Tries to minimize classification error (maximize ability to detect fakes)
Generator: Tries to maximize discriminator’s error (minimize discriminator’s ability to detect fakes)
Training instability: This creates oscillating dynamics instead of converging to a solution
Unlike normal optimization (minimize single loss), GANs have opposing objectives
Can lead to mode collapse, vanishing gradients, or failure to converge
Nash equilibrium is hard to achieve in practice - networks keep “chasing” each other

Question 4

Q

What’s the fundamental difference between how VAEs and Flow-based models approach generative modeling?

Answer

A

Q: What’s the fundamental difference between how VAEs and Flow-based models approach generative modeling?
A:
VAE - Indirect approach through latent semantics:

Assumption: Data comes from meaningful latent factors (age, smile, etc.)
Method: Learn encoder q(z|x) + decoder p(x|z) → combine to get p(x)
Goal: Learn interpretable latent representations
E[log p(x|z)]: “Can latent factors reconstruct original data?”

Flow - Direct distribution transformation:

No assumptions: About latent structure or semantics
Method: Direct transformation z₀ → z₁ → … → x via invertible functions
Goal: Perfect density estimation p(x)
Advantage: Exact likelihood computation, no approximations needed

Question 5

Q

What’s the intuitive idea behind how flow-based models work?
A:
Core intuition: Learn to “bend and stretch” space to transform simple distributions into complex ones

Answer

A

Start: Simple shape (Gaussian blob) - easy to sample from
Transform: Apply series of invertible “bends/stretches” step by step
End: Complex target distribution matching real data Learning process:
Try transformations → check if result matches training data → adjust transformations
Training data acts as “teacher” showing what final shape should look like
Key insight: Instead of learning “what does this look like?” (VAE), flows learn “how do I transform this simple shape into that complex shape?”

Question 6

Q

What’s the intuitive idea behind how diffusion models work?

Answer

A

Core intuition: Learn to reverse a gradual noise corruption process
The process:

Forward (corruption): Gradually add noise to real data until it becomes pure random noise

Reverse (generation): Learn to gradually remove noise step-by-step to recover clean data
Like restoration:
Imagine watching a photo slowly fade into static noise
Train a model to reverse this process: static → slightly cleaner → cleaner → … → perfect photo

Key insight: Instead of generating data directly, learn the “denoising recipe” - how to gradually clean up noise into realistic data

Training: Show model noisy versions of real data and teach it to predict the previous (less noisy) step

Question 7

Q

How are diffusion models trained?

Answer

A

Training objective: Learn to predict what was removed at each denoising step

Training procedure:

Take real image x₀ from dataset
Pick random timestep t (how much noise to add)

Add noise: Create noisy version xₜ using known forward process
Train model: Learn to predict the noise that was added

Loss: Compare predicted noise vs. actual noise added

Key insight: Instead of learning to reconstruct the image directly, learn to predict the noise

Why this works: If you know the noise, you can subtract it to get clean image
Generation: Start with pure noise → repeatedly predict and remove noise → get clean image

Question 8

Q

What is Stable Diffusion and why is it more efficient than regular diffusion models?

Answer

A

Key innovation: Combines VAE + Diffusion by doing diffusion in latent space instead of pixel space
Architecture:

VAE Encoder: Image → compressed latent representation
Diffusion U-Net: Add/remove noise in latent space (with text conditioning via CLIP)
VAE Decoder: Clean latent → high-quality image
Efficiency gain:
Pixel space: 512×512×3 = ~786k dimensions
Latent space: ~64×64×4 = ~16k dimensions (~50x smaller)
Benefits: Dramatically faster training/generation while maintaining image quality
Bottom line: “Best of both worlds” - VAE compression + diffusion quality + computational efficiency

Question 9

Q

Week 10: Generative Adversarial & Diffusion Learning Flashcards

(9 cards)