Deepfake Flashcards
(26 cards)
How do you use training to remove noise in an image?
- Use original, high resolution image as ground truth
- Corrupt the original and use as input
- Run neural netwrok to generate a prediction
- Compare prediction to ground truth
Origin of generative AI
trying to de-noise images
What type of learning is generative AI?
Unsupervised learning
What is downsampling and how is it used?
Reducing the number of pixels (used for corrupting images) – You get the mean of all pixels within a block and make it the new color in one pixel
Is 8k resolution likely to happen
Predicts 8k won’t happen because there isn’t a need for it
Image impainting
The task of filling in holes of an image
Explain the process for training AI with image impainting
-Training using a ground truth image if the actual image, input would be blocked image
-Corrupt the image by adding white blocks
-The model will never see the ground truth but you use the ground truth to compare the prediction
Reconstruction error
Get the error of each individual pixel and then add all the pixel wise errors together
When you add the differences together, you get 0
The way to solve this problem is to square the differences so they’ll never be negative – this new error rate is the message sent back to the neural network so it can adjust its weight
How would you corrupt an image for training a neural network to remove watermarks
Collect a non-watermarked image as ground truth and corrupt it by adding a warter mark for input image
Explain the Midjourney zoom out feature and how it is trained.
-the model zooms out of an image by adding more to the original image
Training:
-Collect a lot of images for ground truth
-Corrupt the image by zooming in (input)
-Run the input through the neural network
-Compare the output to the ground truth
-Calculate the error rate
-Model takes the error rate to adjust weights
How does auto-encoding work?
-Neural network is wide on the outside and narrow in the middle
-The input and output layer have roughly the same number of neurons because you want as many pixels as possible in an image, which helps to generate a high resolution image
In terms of auto-encoding, what do you want to improve image resolution?
You’d want more neurons in the output layer
What does each neuron represent in auto-encoding
represents a pixel value
Why does the middle need to be narrow for autoencoding?
When it becomes narrow you’re forcing the model to learn what is the code of an image
Autoencoder forces the neuron network to compress an image into a short code, based on which the original image can be regenerated
What is an autoencoder?
the combination of a discriminative network (encoding) and a generative network (decoding)
What does the code of an autoencoder include?
Latent attributes
E.g. encoding a face; the latent attributes may be the smile, skin tone, gender, beard, glasses, and hair color
How might you edit a person’s smile (or something else) in an image?
You tamper with the code - specficially, in this case, the smile code
tampering with the code is how you get deepfake
What is they key to training an autoencoder?
*To learn a disentangled representation
For example age may be entangled with gender
Where do you put the encoder for neural networks?
From the input to the code you put the encoder, from the code to the output you put the decoder
How has deepfake been used in Hollywood?
De-aging in Irishman (2019)- used deepfake tech to make the actor appear younger (alter age code)
Creating deepfakes by swapping the codes of different faces (facial identity is merely a combination of codes )
How can AI be used to reconstruct of digitize voice?
-Can reconstruct voice or remove noise
-Every time you speak you change the air pressure, which hits the sensor and changes the pressure (when you speak it is recording numbers that represent air pressure)
How might you build a neural netwrok to remove noise in audio?
Ground truth is actual recording. Distort it by adding noise. Based on the ground truth, the biases and weight attached to the number representing air pressure are altered
The key elements of the voice are condensed into code
Components might be pitch, frequency, accent, etc.
Why do AI voices fail to connect with the listener?
-Lack of emotionality
-Monotone (This is an inaccurate stereotype; all things considered “human” can be based on statistics)
How is AI image synchronized with voice?
Neural network where some neurons takes the text script and some neurons handle the image are combined into one input layer
Output layer has a bunch of neurons corresponding to both the image and the voice
Take original voice, extract the code, and change the voice or mouth code
E.g. removing a swear word from a movie scene, or changing the language
This is a very hard task; Has been significant progress since then. Still a bit rigid, but human image is very realistic