Topic 10: Generative Adversarial & Diffusion Learning Flashcards

(18 cards)

1
Q

What is an adversarial attack?

A

A slight modification of the data can distrup the model. The models are not safe.
The idea is to harvest, so gatering inputs to intentionally fool the model (fooling is the adversarial attack).

You train a model to turn simple noise into realistic, or fooling, examples, instead of learning the full data distribution directly.

Instead of just generating real-looking things, You generate inputs that look real BUT confuse the model (e.g., trick it into thinking a panda is a gibbon).
You’re “harvesting” these weird examples by:
- Starting with noise or a real image
- Applying tiny changes
- Fooling the model and collecting these for analysis or attacks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Generative Adversarial Network (GAN)?

A

GAN buils a generative model by having two neural networks compete with each other.
The generator turns the noise into an imitation of the data, to try and trick the discriminator.
the discriminator tries to indetify the real data from the fake data that are created by the generator

The point is to make the generator so good so the discriminator can’t tell the difference between generated and real data. The discriminator is the “critic” that helps the generator improve by telling it how close (or far) its output is from real data.

https://docs.google.com/document/d/141P_1nL0AJxg27OALqR3ovPok0YUgyCkAFnV3_ufhlA/edit?tab=t.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the training procedure for a GAN?

A

You train for adversarial objects (so fake images) for the Discriminator and the Generator.
The Discriminator tries to identify the synthesises instances
The Generator tries to synthesise fake instances that fools the Discriminator
Its a min-max game:
- D tries to maximise the probability to identify the real instance
- G tries to minimise the probabilitiy that D classifies as a fake instance

The training process leads to finding a G that are as close as possible to the true data distribution

Algorithm:
1. Initialise
2. Discriminator
- Sample a small portion of data points from a noise prior
- Sample from the real data distribution
- We then update the discriminator by SGD by minimising the loss of classifying the fake instance as fake. we continue k-steps in the D loop
3. Generator
- We sample a portion of data points from the noise prior
- We then update the generator by SGD. this is done in the epochs
4. Then we’re done, and get two distribution functions that look alike

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

GAN vs VAE?

A

VAE learns a structured representation of the data, but sacrifices sharpness.
GAN learns to mimic the data perfectly, but doesn’t care about structure, only realism.

In VAEs, we intentionally regularize the latent space, making sure it follows a simple Gaussian prior. This makes the model easier to work with, but also forces some over-smoothing, which is why the outputs can be blurry.

In GANs, we don’t control the latent space with a strict prior. Instead, we let it emerge by training the generator to map noise to good outputs, meaning only the parts of the latent space that lead to realistic images get used, and the rest can be ignored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain how GANs can transform distributions

A

GANs are distribution transformers because they learn to transform simple noise into complex, structured, realistic data. They do this by training a generator to produce outputs that match the real data distribution, without ever needing to write down that distribution explicitly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe progressive GANs (ProGAN)

A

We need ProGAN because regular GANs are too unstable for high-resolution image generation. By growing both the generator and discriminator progressively, ProGAN makes training smoother, more stable, and more capable of producing sharp, realistic images.

It’s different because it starts small, and then grows over time in layers. So both G and D start with a tiny image size, and then both create blurry faces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a conditional GAN (paired translations)?

A

The key is that G and D are conditioned on auxilliary information (i.e. class labels or domains).
what we can do is to e.g. give the generator an image that has been segmented, and then the G would learn to output for a very specific case, it can be conditioned on the details. we force the G to come up with very specific solutions for very specifc cases, this is one way of conditioning the genrator on what it should do (you can condition many ways)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is cycleGAN (unpaired transformation between two GANs)?

A

The key is to have two GANs that cycle on unpaired image collections. We could have a downstream task of applying a specific style (we want a zebra in the masking of a horse), and we can do this by using cycleGAN. What it does it that we have an instance, give it to the G, then it will be given to the D, and then we have the transformed output, THEN because its a cycle, we do it the other way around also. not as different as CLIP, and not as different as the reconstruction loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is StyleGAN?

A

StyleGAN improves on traditional GANs by adding a mapping step and injecting style information at multiple layers of the generator. This enables high-resolution, realistic image synthesis with better control, sharper details, and disentangled latent features like pose, expression, and texture.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the disadvantages of training GANs

A

While GANs are powerful tools for generating realistic data, they are challenging to train, evaluate, scale, and implement.

Training Instability: GANs are hard to train due to the adversarial setup between generator and discriminator, often leading to unstable convergence.

Mode Collapse: The generator may produce only a few repeated outputs instead of capturing the full data diversity.

Hyperparameter Sensitivity: Small changes in learning rates, batch size, or architecture can heavily affect performance.

High Computational Cost: GANs require significant hardware and long training times.

Lack of Standard Evaluation Metrics: It’s hard to judge how good a GAN is, as standard metrics like accuracy don’t apply.

Mode Dropping: (will follow the same pattern, and not “flow”) Some parts of the data distribution may be ignored, leading to biased results.

Difficulty in Scaling: Generating high-res or complex data increases instability and computational demands.

Ethical & Security Concerns: GANs can be misused to create deepfakes or deceptive media.

Need for Large Datasets: GANs generally require a lot of training data to perform well.

Complex Implementation: Building and training GANs requires expertise and deep understanding of neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a flow-based model?

A

Flow-based models learn to explicitly model the probability density of your data using invertible transformations, unlike VAEs and GANs, which learn distributions implicitly.
Flow-based models use a series of invertible transformations to map between a simple latent space and complex data. This lets them explicitly model data probability, avoiding mode collapse and enabling exact likelihood computation, something VAEs and GANs can’t do.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the markov process?

A

Markov assumption: the probability of a future event is only dependent on the previous event.
We can derive the joint distribution for any finite length sequence.
The markov chain: from which we can at any point derive a probability as a probability of the joint distribution of the previous event

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are Variational Diffusion Models (VDMs)?

A

Key idea:
1. slowly add random noise to the data until we end up with an image of only gaussian noise (forward diffusion process)
2. learn to reverse the random diffusion (reverse diffuion process)

like in GAN we draw from the noise distribution, and make a perfectly detailed image from the noisy image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is the training of diffusion models?

A

We fit a VDM similar to how we fit a VAE: we want to maximise the evidence lower bound (ELPO) because the marginal likelihood is intractable. So we instead make the best guess (lower bound) and push it as high as possible.).

we make use of the markov property and bayes rule when finding the log-likelihood of the data point under the model, when its paramterised by theta.

forward process: take a data point and add more and more gaussian noise until we have something that is only total gaussian noise
reverse process: we take the data point that is fully gaussian, and make it into the original data point
residual 𝜖 (drifting term): the model’s prediction of the noise added during diffusion. It tells you how to move each point back toward the data distribution. In practice, the neural network learns to predict this residual noise, so we can subtract it step-by-step and denoise the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the semantic latent space of diffusion models

A

The semantic latent space (h-space) is
spanned by activations in bottleneck of U-Net.
- It can derive PCA in h-space
- we can sample N images and save the bottleneck of activations
- For each timestep t, we calculate the PCA of the collection

The semantic latent space (h-space) in diffusion models is captured in the bottleneck layer of the U-Net. By analyzing this space (e.g., using PCA), we can discover meaningful directions that correspond to semantic attributes of the image, enabling controlled generation and editing based on pose, age, or gender

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is stable diffusion?

A

The key idea: it combines a progressive diffusion process on VAEs and ViT-L/14 transformer-based CLIP representations

the input and output data points can be of different domains. the whole model is based on variation diffusion process. instead of having the process of diffusion as the final product and inverting it, the inversion process builts in architectural constraints.

Stable Diffusion uses a VAE to compress images, then applies a diffusion model in latent space to generate high-quality, controllable images. It supports conditioning with CLIP-based prompts (like text or images), making it efficient, flexible, and ideal for tasks like text-to-image generation, editing, and inversion.

This architecture shows how Stable Diffusion efficiently generates images by:
- Compressing them to latent space with a VAE
- Applying a denoising diffusion model using a U-Net with cross-attention
- Decoding the cleaned latent back into an image
- Conditioning inputs (like text) guide the generation process, making the model versatile and powerful for various tasks.

14
Q

What are conditional image generation with VDMs (Imagen, DALL-E 2)?

A

Imagen: uses large-scale diffusion models for conditional image generation. It conditions on text or partial images and refines results progressively through resolution stages. This makes it capable of tasks like inpainting, colorisation, and high-quality text-to-image generation with state-of-the-art photorealism.

DALL-E 2: combines a diffusion model, text and image encoders, and a CLIP-alignment objective to generate high-quality, semantically accurate images from text. It improves control (via negative prompts), realism (via diffusion), and efficiency (via latent decoding). limitations: its details werent that good

14
Q

Explain the singular vectors of the Jacobian in diffusion models

A
  • Interpolate smoothly between attributes
  • Target specific regions using masks
  • Improve control via disentangled editing, where semantic changes don’t leak into unrelated parts of the image

Masking makes it able to “inject” a generative example, so if we mask the mouth we can control how it should look in different images

This approach bridges the gap between raw generative power and controllable generation