Week 9: Variational Generative Models Flashcards

(13 cards)

1
Q

What is the difference between variational autoencoder a standard “auto-encoder”

A

An autoencoder:
The more blurry the image means a very strong bottleneck. We force the encoding to what is most crucial.

We usually try this in a very simple way, by looking at the input (end to end model)

We calculate the reconstruction loss: the difference in how the image looked and how the output looks. The mean squared error is used here.

Variational auto encoder (probabilistic):

Keep the architecture.
The latent code (the compressed image) we could now say is a bunch of gaussian distributions, which is defined by a mean vector mu and variance vector sigmad.
So from this latent code, we can now sample each image

Training, again we take the reconstruction loss, and add a regularization term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

WHat would happen if we where to sample something new for AE and VAE?

A

VAE we can generate, AE we cannot, or it would be garbage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Q: How does the density (dimensionality) of latent coding affect reconstruction quality?

A

Less dense (lower dimension): More compression → More blurry reconstructions (less room to store details)
More dense (higher dimension): Less compression → Sharper reconstructions (more room for fine details)
This is the fundamental compression vs. quality trade-off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is posterior probability and why do we want it?

A

Posterior probability p(θ|data) represents what we know about model parameters θ after observing the data. We want it because it helps us make better predictions by:

Quantifying uncertainty about parameters
Considering ALL plausible parameter values (not just one “best” guess)
Making more robust, less overconfident predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is exact inference often intractable?

A

A: Computing the exact posterior p(θ|x,D) requires evaluating integrals like ∫ p(y|x,θ)p(θ|D)dθ, which:

Requires checking ALL possible values of θ
Becomes exponentially expensive for complex models
Results in NP-hard computational problems
Is impossible to compute exactly in practice

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why do VAEs typically use Gaussian priors?

A

Easy sampling: Simple to generate random samples
Mathematical convenience: KL between Gaussians has closed-form solution
Theoretical justification: Central Limit Theorem, common in Bayesian ML
Nice properties: Differentiable, well-understood mathematically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the ELBO in VAEs and what are its components?

A

ELBO (Evidence Lower BOund) = E[log p(x|z)] - KL(q(z|x)||p(z))

Reconstruction term E[log p(x|z)]: How well decoder reconstructs input
KL term KL(q(z|x)||p(z)): How close encoder output is to prior
Together they approximate the intractable posterior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do VAEs solve the intractable posterior problem?

A

Instead of computing intractable p(z|x), VAEs:

Approximate: Use encoder to output q(z|x) ≈ p(z|x)
Optimize ELBO: Balance reconstruction quality + prior matching
Mathematical trick: Maximizing ELBO indirectly approximates the true posterior through the combination of both terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does generation work in a trained VAE?

A

Sample: Draw random z ~ N(0,I) from prior

Decode: Pass z through decoder network

Output: Get generated data sample

Why it works: Training forced encoder outputs to match this same prior distribution, so decoder has seen similar latent codes before

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why can’t we backpropagate gradients through sampling layers in VAEs? (stocasiticity)

A

Problem: The sampling operation z ~ N(μ, σ) involves randomness, which is not differentiable.

Regular autoencoder: Input → z (deterministic) → Output ✅ Gradients flow
VAE: Input → μ, σ → z ~ N(μ, σ) → Output ❌ Sampling breaks gradient flow
Result: Network can’t learn how to adjust μ, σ to improve loss because gradients can’t pass through the random sampling step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

We use the reparameterization trick to handle stochasticity in VAE training.

A

The Problem: In VAEs, the sampling operation z ~ N(μ, σ) is stochastic, which breaks gradient flow during backpropagation. Gradients cannot pass through random sampling operations, so the network cannot learn to optimize the encoder parameters μ and σ.
The Solution: Instead of sampling z directly from N(μ, σ), we reparameterize as:

Sample ε ~ N(0, I) (standard Gaussian noise)
Compute z = μ + ε ⊙ σ (deterministic transformation)

Why This Works:

z still follows the same distribution N(μ, σ)
But now z is a deterministic function of μ and σ
The randomness is isolated in ε (which doesn’t need gradients)
Gradients can flow through the deterministic path: ∂z/∂μ = 1, ∂z/∂σ = ε

This transforms z from a stochastic node into a deterministic function, enabling proper backpropagation while preserving the probabilistic nature of the VAE.”
Key Points to Hit:

Problem: Stochastic sampling breaks gradients
Solution: Reparameterization trick
How: z = μ + ε ⊙ σ instead of z ~ N(μ, σ)
Why it works: Makes z deterministic while keeping same distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we disentangle the latent space?

A

We can add a weight to the KL term

What the KL Term Really Does:
KL(q(z|x)||p(z)) measures how far your encoded latent codes are from the standard Gaussian prior N(0,I).
Think of β as Your “Deviation Budget”:
Low β (β = 1):

“You have a BIG budget to deviate from N(0,I)”
Model can spread features all over the latent space
Result: Wasteful, messy encoding - multiple features mixed in each dimension

High β (β = 250):

“You have a TINY budget to deviate from N(0,I)”
Model must be very careful about how it uses latent space
Forced efficiency: Can’t afford to waste dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do VAE-RNNs extend VAEs to work with sequential data?

A

A:
VAE - Indirect approach through latent semantics:

Assumption: Data comes from meaningful latent factors (age, smile, etc.)
Method: Learn encoder q(z|x) + decoder p(x|z) → combine to get p(x)
Goal: Learn interpretable latent representations
E[log p(x|z)]: “Can latent factors reconstruct original data?”

Flow - Direct distribution transformation:

No assumptions: About latent structure or semantics
Method: Direct transformation z₀ → z₁ → … → x via invertible functions
Goal: Perfect density estimation p(x)
Advantage: Exact likelihood computation, no approximations needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly