Week 9: Variational Generative Models Flashcards
(13 cards)
What is the difference between variational autoencoder a standard “auto-encoder”
An autoencoder:
The more blurry the image means a very strong bottleneck. We force the encoding to what is most crucial.
We usually try this in a very simple way, by looking at the input (end to end model)
We calculate the reconstruction loss: the difference in how the image looked and how the output looks. The mean squared error is used here.
Variational auto encoder (probabilistic):
Keep the architecture.
The latent code (the compressed image) we could now say is a bunch of gaussian distributions, which is defined by a mean vector mu and variance vector sigmad.
So from this latent code, we can now sample each image
Training, again we take the reconstruction loss, and add a regularization term.
WHat would happen if we where to sample something new for AE and VAE?
VAE we can generate, AE we cannot, or it would be garbage
Q: How does the density (dimensionality) of latent coding affect reconstruction quality?
Less dense (lower dimension): More compression → More blurry reconstructions (less room to store details)
More dense (higher dimension): Less compression → Sharper reconstructions (more room for fine details)
This is the fundamental compression vs. quality trade-off
What is posterior probability and why do we want it?
Posterior probability p(θ|data) represents what we know about model parameters θ after observing the data. We want it because it helps us make better predictions by:
Quantifying uncertainty about parameters
Considering ALL plausible parameter values (not just one “best” guess)
Making more robust, less overconfident predictions
Why is exact inference often intractable?
A: Computing the exact posterior p(θ|x,D) requires evaluating integrals like ∫ p(y|x,θ)p(θ|D)dθ, which:
Requires checking ALL possible values of θ
Becomes exponentially expensive for complex models
Results in NP-hard computational problems
Is impossible to compute exactly in practice
Why do VAEs typically use Gaussian priors?
Easy sampling: Simple to generate random samples
Mathematical convenience: KL between Gaussians has closed-form solution
Theoretical justification: Central Limit Theorem, common in Bayesian ML
Nice properties: Differentiable, well-understood mathematically
What is the ELBO in VAEs and what are its components?
ELBO (Evidence Lower BOund) = E[log p(x|z)] - KL(q(z|x)||p(z))
Reconstruction term E[log p(x|z)]: How well decoder reconstructs input
KL term KL(q(z|x)||p(z)): How close encoder output is to prior
Together they approximate the intractable posterior
How do VAEs solve the intractable posterior problem?
Instead of computing intractable p(z|x), VAEs:
Approximate: Use encoder to output q(z|x) ≈ p(z|x)
Optimize ELBO: Balance reconstruction quality + prior matching
Mathematical trick: Maximizing ELBO indirectly approximates the true posterior through the combination of both terms
How does generation work in a trained VAE?
Sample: Draw random z ~ N(0,I) from prior
Decode: Pass z through decoder network
Output: Get generated data sample
Why it works: Training forced encoder outputs to match this same prior distribution, so decoder has seen similar latent codes before
Why can’t we backpropagate gradients through sampling layers in VAEs? (stocasiticity)
Problem: The sampling operation z ~ N(μ, σ) involves randomness, which is not differentiable.
Regular autoencoder: Input → z (deterministic) → Output ✅ Gradients flow
VAE: Input → μ, σ → z ~ N(μ, σ) → Output ❌ Sampling breaks gradient flow
Result: Network can’t learn how to adjust μ, σ to improve loss because gradients can’t pass through the random sampling step
We use the reparameterization trick to handle stochasticity in VAE training.
The Problem: In VAEs, the sampling operation z ~ N(μ, σ) is stochastic, which breaks gradient flow during backpropagation. Gradients cannot pass through random sampling operations, so the network cannot learn to optimize the encoder parameters μ and σ.
The Solution: Instead of sampling z directly from N(μ, σ), we reparameterize as:
Sample ε ~ N(0, I) (standard Gaussian noise)
Compute z = μ + ε ⊙ σ (deterministic transformation)
Why This Works:
z still follows the same distribution N(μ, σ)
But now z is a deterministic function of μ and σ
The randomness is isolated in ε (which doesn’t need gradients)
Gradients can flow through the deterministic path: ∂z/∂μ = 1, ∂z/∂σ = ε
This transforms z from a stochastic node into a deterministic function, enabling proper backpropagation while preserving the probabilistic nature of the VAE.”
Key Points to Hit:
Problem: Stochastic sampling breaks gradients
Solution: Reparameterization trick
How: z = μ + ε ⊙ σ instead of z ~ N(μ, σ)
Why it works: Makes z deterministic while keeping same distribution
How can we disentangle the latent space?
We can add a weight to the KL term
What the KL Term Really Does:
KL(q(z|x)||p(z)) measures how far your encoded latent codes are from the standard Gaussian prior N(0,I).
Think of β as Your “Deviation Budget”:
Low β (β = 1):
“You have a BIG budget to deviate from N(0,I)”
Model can spread features all over the latent space
Result: Wasteful, messy encoding - multiple features mixed in each dimension
High β (β = 250):
“You have a TINY budget to deviate from N(0,I)”
Model must be very careful about how it uses latent space
Forced efficiency: Can’t afford to waste dimensions
How do VAE-RNNs extend VAEs to work with sequential data?
A:
VAE - Indirect approach through latent semantics:
Assumption: Data comes from meaningful latent factors (age, smile, etc.)
Method: Learn encoder q(z|x) + decoder p(x|z) → combine to get p(x)
Goal: Learn interpretable latent representations
E[log p(x|z)]: “Can latent factors reconstruct original data?”
Flow - Direct distribution transformation:
No assumptions: About latent structure or semantics
Method: Direct transformation z₀ → z₁ → … → x via invertible functions
Goal: Perfect density estimation p(x)
Advantage: Exact likelihood computation, no approximations needed