Topic 9: Variational Generative Models Flashcards by Niko Dice

What is encoding (PCA and FA) and latent variables?

Encoding in a latent space refers to wanting to reduce the dimensionality in the data (using e.g. PCA or Factor Analysis), or wanting to learn a compressed latent representation (or to learn the structure of manifolds).

PCA is a way to describe the data. It creates new variables, so the principal components which are a linear combinations of the original data, and will try and capture the variables that are the most explanatory in their variance.

Factor Analysis is inherently different. It assumes that the observed variables are not random, but that they’re actually influenced by these latent variables. Latent variables are not directly observed, but are inferred as the underlying causes of the patterns in the data.
So, we some up with components which can describe or generate data points.

While PCA describes data by capturing directions of highest variance, Factor Analysis models the hidden causes behind the data using latent variables. This makes FA better suited for building models that reflect real-world generative processes, including mixtures of observable and unobservable factors.

Latent variables: Underlying features of the data. These are the structures of the data. We care about capturing these latent variables that yield the observations we make.

How well did you know this?

Not at all

Perfectly

What is clustering by partioning?

We have well arbitrary datapoints that are scattered in an arbitrary dimensionality.
The goal is to find a partioning of these data points, based on how they might cluster.
Partitioning approach: We assign each point to exactly one cluster. These clusters are defined by a centroid, which is the central point in the cluster. We want to minimise the sum of squares distances between each point and its cluster centroid, m_k. Both between the clusters and within the. This is the error function E. clusters.
We can evaluate the clustering using:
- inter-cluster distance: distance refers to the distance between data points belonging to different clusters
- intra-cluster distance: distance refers to the distance between data points within the same cluster

This approach is used in k-means clustering

How well did you know this?

Not at all

Perfectly

What is k-means clustering?

Each cluster is represented by its centroid μ_k. μ_k will be the mean of the cluster’s points. The data is assigned to the closests centroid.
1. Choose k
2. Partition points in k non-empty subsets
3. Compute clusters centroids
4. Assign each object to the cluster with the nearest centroid
5. Did the assignments change? If so we continue from step 3 again, if not we’re done

How well did you know this?

Not at all

Perfectly

What is Gaussian Mixture Model clustering with EM?

Instead of using k-means clustering, where we use the centroid to derive where other points belong, we use distributions instead, and then a point could belong to multiple distributions, but just with different probabilities.

Expectation-Maximisation:
- Each cluster is represented by a gaussian distribution
- The cluster assignment is “soft”, meaning each data point has a probability of belonging to each cluster. These clusterings are also inferred from the data
- The true cluster assignments are latent variables, so they’re hidden, but inferred from the data.

Then we can use the k-means algorithm (and changing it a bit):
1. Choose k
2. Initialise k random components
3. E-step: compute a distribution on labels of the points, using current parameters
4. M-step: update the parameters using current guess of label distribution
5. Did the log-likelihood change? if so, back to step 3, if not, you’re done

We then end up with distributions that are clearly clustered

How well did you know this?

Not at all

Perfectly

How do you find probabilistic clusters?

To find probabilistic clusters, we assume that the data is generated from an underlying mixture of probability distributions, such as a Gaussian Mixture Model (GMM). This approach is related to Factor Analysis, since both model the data using latent (hidden) variables.

Approach: You model the complex data with less complex probability density distributions, e.g. GMM

The idea:
- A probability distribution can describe how much a point belongs to one or the other cluster
- This means, using the probability distributions, a point can therefore belong to one or more clusters at once, so we model the complex data with a probability distribution

The modelling:
- We define the data distribution as a GMM
- Each Gaussian has its own mean, μ, and its own variance, σ. The model is parametrised by θ = all the means and more.

Estimating the model:
- We use MLE to see how well our model fits the data
- This is intractable, because the cluster assignments are latent variables (not directly observed)
- We’d also have to search over all the possible parameter combinations in the continuous space, which is computationally infeaisble

The solution:
- We use the EM-algorithm to overcome the intractability
- E-step: Estimate how likely each point belongs to each cluster (i.e., compute soft assignments using the current parameters).
- M-step: Update the Gaussian parameters to better fit the data, weighted by those soft assignments.

How well did you know this?

Not at all

Perfectly

How do you derive the GMM using EM?

GMMs are derived using the EM algorithm by treating cluster assignments as latent variables. In the E-step, we compute soft assignments (responsibilities), and in the M-step, we use them to re-estimate the model parameters. This iterative process maximises the likelihood of the observed data and forms the basis for more advanced models like VAEs.

At the expectation step:
We find the responsibility of cluster k for generating data n, then we evaluate the posterior probability that each data point came from a particular cluster

at the maximisation step:
- We maximise the log likelihood for the complete data. Using the posterior prob. as cluster specific weights on data points to separateky re-estimate each cluster model π

this the underlying method of the VAE

How well did you know this?

Not at all

Perfectly

What is an autoencoder?

An AutoEncoder tries to self encode the input data it sees.

Example using MNIST:
We feed the image as raw data, and pass it through a deep neural netwokr (let’s say it’s convolutional layers). The AE takes these successive layers, and tries to output a low-dimensional latent space (a representation of the data), which is our goal in what we’re trying to model and trying to predict. We use the data, x, that has been compressed to representation, z (latent code, bottleneck), we now use the z as a signal to learn these features in an unsupervised way, by reconstructing the input.
Pipeline:
Input —> Encoder —> Bottleneck —> Decoder —> Reconstructed output
Encoder: Maps the input into a set of low-dimensional latent variables (a compressed version).
Bottleneck (latent code): This is z, the compressed representation from the encoder. This is the set of low-dimensional latent variables, that we will use to feed the decoder to reconstruct the output. The lower the dimensionality we get in, the more we are going to compress and the worse we are going to do in the reconstruction/decoder step.
Decoder: Reconstructs the original input from the latent code.
Reconstruction loss: A comparison of the original input to the reconstructed output. We want to minimise the distance between the input and the reconstructed output. We can use a mean squared error between the input and the reconstructed output

NB: AE IS DETERMINISTIC, WHAT GETS FED INTO THE NETWORK IS WHAT WE GET OUT. JUST A SAMPLE

How well did you know this?

Not at all

Perfectly

What is the point of generative modelling?

Generative modelling aims to learn the underlying data distribution so we can generate new samples, detect structures, and perform intelligent reasoning about data. This includes:

Generating data for purposes like augmentation, cross-domain learning, or even creating novel content (e.g., art or faces).

Density estimation, helping with dimensionality reduction and outlier detection, e.g., identifying rare driving conditions in car data.

Representation learning, where the model learns useful internal features from data, often in an unsupervised way.

Imputation, which includes filling in missing values (interpolation) or projecting onto learned structures.

Latent space interpolation, which allows us to smoothly transition between data points in a compressed representation—useful for de-biasing, synthesis, and data balancing.

How well did you know this?

Not at all

Perfectly

What types of generative models exists?

Probabilistic graphical models: by using neural networks as our representation space, we would have deep generative models, such as:
- Autoregressive models (Transformers, RNNs): Learn conditionals of each variables given the past (generate data one step at a time, by predicting each next element based on what has come before)
- Variational Autoencoders (VAEs): Maximises the variational lower bound (learns to compress data into a lower-dimensional latent space, then reconstruct it. They don’t compute exact probabilities, but instead approximate them using a technique called the variational lower bound.)
- Generative Adversarial Networks (GANs): Adversarial training (A generator that creates fake data. A discriminator that tries to distinguish real from fake.
They compete against each other, improving until the generator produces highly realistic data.)
- Flow-based models: Invertible transform of distributions (learn an invertible mapping from simple noise (like a Gaussian) to complex data. Because the transformation is reversible and smooth, you can compute the exact probability density of any sample.)
- Diffusion models: Gradually add Gaussian noise and then reverses (These models add noise step-by-step to data until it becomes pure noise (like a blurred-out image). Then they learn how to reverse that process, step-by-step, to recover the original data.)

The key features are:
- All models support sampling (can generate data).
- Some give exact density estimates (Autoregressive, Flow), others use approximation (VAE, GAN).
- The choice depends on trade-offs between sample quality, training stability, and likelihood estimation.

How well did you know this?

Not at all

Perfectly

What is a Variational Autoencoder (VAE)?

This introduces diversity in the sense that (:P) we can generate new data samples. VAEs differ from AEs by introducing some randomness (stochasticity). VAE is a probabilistic twist on AE.

Instead of z being purely deterministic, we introduce some sampling (a notion of stochasticity). Instead of learning the latent variables directly, if we parametrise each latent variable as a prob. distribution, defined by a mean, μ, and a standard deviation, σ, we can learn these vectors of means and standard deviations separately, such that we get a dist. over each of the latent variables in our latent space.

Now we can sample from these μ and σ, to now produce new data instances.

Encoder: q_φ(z|x), compute the prob. dist. of the latent variables given the input data (weights are φ)
Decoder: p_θ(x|z), compute the prob. dist. of the data given the latent variables (weights are θ)

The training can be done end-to-end with one loss function. It’s a function of the inpput data and these two sets of weights:
L(φ, θ, x) = reconstruction loss + regularisation term

Regularisation term: we’re trying to learn a prob. dist. over these latent variables, given the data, x (encoder). We want to make some initial prior guess on what these dist. of these latent variables, should look like. infer and enforce latent variables that follow this prior. We need to compute a distance between the dist. of the latent variables and our prior of what that prob. dist should look like.
D(q_φ(z|x)||p(z))

How well did you know this?

Not at all

Perfectly

What are good priors?

A prior is the assumed distribution of the latent variables. Since we typically don’t know the exact structure of the latent space beforehand, we often use a default prior, most commonly a standard Normal distribution: 𝑝(𝑧)=𝑁(μ=0, σ^2=1)
- Encourages the encoder to locate the latent variables evenly and smoothly around the centre of the latent space
- We use a distance metric (KL-divergence) to penalise the network if it tries to memorise the data by making tight clusters instead of encouraging a smooth continuous distribution

The goal of using a prior is to:
- Regularise the latent space
- Encourage smooth and continuous encodings
- Ensure even spread around the center of the latent space.

A good prior prevents the model from collapsing all latent points into small clusters and instead encourages them to be evenly distributed. The regularization term in the VAE loss measures how different the inferred latent distribution is from the prior and penalizes deviation from it.

This helps ensure that when we later sample new latent vectors (e.g., for generation), they fall in a region of the latent space that the decoder has learned to handle well.

How well did you know this?

Not at all

Perfectly

Why do we use regularisation in VAEs?

The Purpose of Regularisation:
Continuity: points that are close to each other in the latent space -> corresponds to similar content after decoding

Completeness: sampling from anywhere in the latent space -> we should still be able decode some meaningful content

Without regularisation:
Incontinuity: Two points that are close have completely different meanings in the latent space, and are not similary decoded in the original data space, which we don’t want.
Incompleteness: Draw some sample from the latent space, we could end up with not meaningful data

With regularisation:
Encourage the closeness, where points that end up close in the latent space, are semantically related.
We can sample from anywhere in this grid of latent space, and still get a meaningful data instance out.

This makes VAEs particularly powerful for generative modeling, interpolation, and representation learning.

How well did you know this?

Not at all

Perfectly

Give an example of a regularised VAE

An example of a regularized VAE is one trained on face images. In this model:
- The encoder maps the input image 𝑥 to a probabilistic latent code by predicting a mean 𝜇 and standard deviation 𝜎 of a Gaussian distribution.
- A latent variable 𝑧 is sampled from this distribution using the reparameterization trick, and then passed into the decoder to reconstruct the image.
- The training objective includes regularization, via the KL divergence between the learned distribution 𝑞(𝑧∣𝑥)
and a simple prior 𝑝(𝑧)=𝑁(0,𝐼)

This regularisation forces the latent space to be smooth and continuous, allowing:
- Interpolation between latent codes (e.g., morphing one face into another)
- Sampling meaningful new images

Compared to a standard Autoencoder (AE), a VAE is a true generative model. If we sample from an AE’s latent space, it often produces garbage because it has no structured latent prior. But a VAE’s structure ensures that sampled latent vectors decode into plausible, coherent images.

While reconstructions from VAEs tend to look blurry, this is due to the probabilistic smoothing imposed by the Gaussian assumption in the latent space. For example, it may struggle to sharply model discrete attributes like hats or glasses, which don’t follow a clean Gaussian distribution.

How well did you know this?

Not at all

Perfectly

What does the latent code represent?

The latent code in a VAE represents hidden or abstract features that capture the essential characteristics of the input data. These are not directly observed but are learned during training to compress and encode important factors of variation.

For example, in the case of images, the latent code might represent visual attributes such as texture smoothness, color intensity (e.g., redness), or object shape. These features are not explicitly labeled, but the model learns to organize them in a way that allows accurate reconstruction of the input from the latent space.

Each dimension of the latent code can correspond to a specific feature or combination of features, and the overall vector forms a compact, meaningful representation of the data instance.

How well did you know this?

Not at all

Perfectly

Why is exact inference intractable?

Inference: what we actually want, this is why we build a model. we would like to build a model about the characteristics of the world, describing the latent factors. hopefully it would allow us to extrapolate and generalise from the very few observations we did in the data collection (thats the point of the model).
posterior probability: if we have a posterior probability we can infer from a given observation, from a data point or from a data collection, a specific latent variable theta, e.g.:
- summarise what we actually know from our domain, so particularly the data collection we have done
- for a specific data point, we can even quanitify the uncertainty

but computing the posterior predictive distribution by using marginalisation is intractable, it requires exponential amount of time, and is therefore an NP-hard problem.
Therefore, instead of computing exact inference, we use approximate inference techniques such as MLE, MAP and KL-Divergence

How well did you know this?

Not at all

Perfectly

What is KL-Divergence?

Study These Flashcards

In training a VAE, KL divergence helps us compare the distribution of the latent variables learned from the data with a clean, simple prior distribution. It provides a reference point, “how far off is what I’ve learned from what I wanted?”, and that helps keep the latent space organized, smooth, and interpretable.

Talk about marginal likelihood being intractable

Study These Flashcards

In a VAE, the ideal goal is to maximise the marginal likelihood of the data by integrating over all possible latent variables. However, this requires computing the true posterior 𝑝(𝑧∣𝑥), which is intractable due to the complexity of marginalizing over latent space.

To overcome this, VAEs introduce an approximate posterior 𝑞(𝑧∣𝑥), implemented by a neural network called the encoder or inference network. This approximation makes training tractable by optimizing a lower bound (the ELBO), which includes a reconstruction term and a KL divergence regularization.

The encoder learns to approximate the true posterior distribution, enabling efficient inference and allowing the decoder to generate meaningful samples from the latent space.

clustering: finding the clustering centroids, is a simple way of finding a decision boundary. the EM was a way of finding an E-step, so what is the probability of a cluster, a way to describe how specific the latent factors align

What is ELBO?

Study These Flashcards

Evidence lower bound (ELBO): we use this when we want to calculate the maximal log-likelihood.
instead of deriving the posterior, we approximate the lower bound.
deriving this is based on two components:
- maximizes the log likelihood
- minimises the KL divergence between the posterior and the prior

The second part of ELBO minimizes the KL divergence between the posterior and the prior. Since we usually assume the prior is a standard Gaussian distribution (why?), and minimizing the KL will make the posterior more similar to the prior, which means we are trying to make the posterior to be a smooth Gaussian distribution, while at the same time expand evenly through the entire latent space, so it gives the model more randomness.

Intractability in marginal likelihood

Study These Flashcards

Problem: We want to model 𝑝(𝑥), but it’s intractable.
Trick: Use an encoder 𝑞(𝑧∣𝑥) to approximate the true posterior.
Objective: Maximize the ELBO, a tractable surrogate for log𝑝(𝑥).
Tool: Use Monte Carlo sampling to estimate gradients and optimize with SGD.

What is monte-carlo sampling?

Study These Flashcards

A numerical technique that uses random samples, to approximate values that are hard to compute exactly. Especially integrals in high dimensions.
It’s a simplified ELBO, we sample over the data and use the monte carlo sampling.

How does VAE handle stochasticity?

Study These Flashcards

In a traditional autoencoder, the operation is deterministic. We have in out encoder, normal layers in a neural network, and we have normal layers in the decoder that take the latent variables and decode back out to generate an output. What this mean is that if we feed in an input, we are going to get the same output reconstructed out. Deterministic encoding.

We need to introduce some randomness into the autoencoder.

Instead of having a purely deterministic layer z, we are going to introduce some sampling, some notion of stochasticity. Instead of learning the latent variables directly if we parameterised each of these latent variables as a probability distribution defined by a mean μ and a std σ, we can now learn these vectors of mean and stds separately such that we get a distribution over each of the latent variables in our latent space. It is a probabilistic twist on autoencoders!

Now we can sample from these means and stds to now produce new data instances.
Same concept, but with probabilistic twist through it.

The encoder in VAE outputs a distribution, and to generate a latent vector z, we sample from a distribution, but this sampling is done at random, so backpropagation is being broken due to sampling.

Idea: consider a sampled latent vector z as a sum of:
- fixed vector mu
- fixed vector sigma, scaled by random constants drawn from prior distribution

to fix this we use the reparameterisation trick

What is the reparameterisation trick?

Study These Flashcards

we would need the autoencoder to have a stochastic element, but not for it to be a part of the structural backpropagation. we can add in an external node that will introduce the stochasticity, but without breaking the backpropagation. reparameterisation rewrites the stochastic sampling as a deterministic function of noise, and allows the gradients to flow through the sampling step, but without breaking the backpropagation

What are latent perturbations?

Study These Flashcards

We can increase or decrease a single latent variable and keep all others fixed
the different dimensions of z encode different interpretable latent features
the ideal: that the latent variables are uncorrelated, so we can enforce a diagonal prior on the latent variables to encourage independence (disentanglement)

How can we decorrelate (thus disentangle) the latent variable?

Study These Flashcards

decorrelating latent variables: we want to keep the variables independent, this is what disentangled mean. we want disentanglement as it makes the model better at generalising

Disentanglement means that each latent variable in a model controls an independent, interpretable aspect of the data. This makes the latent space easier to understand and manipulate. In VAEs, disentanglement can be encouraged through stronger regularization (like in β-VAE).

β-VAE: adds extra pressure/penalty on the latent space to encourage disentanglement. it has a stronger KL penalty (β > 1) that forces the model to learn more disentangled latent representations. This helps separate out different factors of variation, making the latent space more interpretable and controllable.
β-VAE is a VAE trained with weighted KL term that is adjusted by a constant in its loss

What are sequences with VAE?

VAE also works for sequential data, instead of only images. This is a Variational Autoencoder for sequences. The encoder RNN summarizes a whole sequence into a latent vector 𝑧, and the decoder RNN generates a new sequence conditioned on 𝑧. It allows meaningful generation, compression, and interpolation of sequential data.

What is CLIP (contrastive learning & autoregressive models)?

CLIP architecture is a training scheme. it can train images and text embeddings. Uses contrastive learning to align image and text representations Encoders: - Text encoder (e.g., Transformer) - Image encoder (e.g., ViT or ResNet) Training goal: - Bring matching image–text pairs closer in embedding space - Push non-matching pairs apart - Trains on huge, noisy datasets with many image–caption pairs Latent space is multimodal (image and text): - Allows for zero-shot recognition (can do something it wasn't trained on) - Enables interpolation between modalities

What is the difference between AE and VAE?

AE: When you want to compress features or denoise - Reduce dimensionality (e.g., image compression) - Pretrain encoder layers for a classifier - Denoise corrupted inputs (Denoising AE) You don’t need to sample or generate new data - AE doesn’t define a proper generative model. - Latent space is not regularized, so interpolation or sampling doesn’t work well. VAE: When need a proper generative model - When you want to generate new samples similar to training data. - You need smooth interpolation between data points. You want a structured, continuous latent space. Useful for: - Interpolating between faces, objects, etc. - Doing vector arithmetic in latent space (e.g., in NLP or image generation) - Visualizing clusters or disentangled representations

Topic 9: Variational Generative Models Flashcards

(27 cards)