10_advanced deep learning concepts Flashcards
By which two factors is the performance of a model limited?
- architecture-driven limitations:
limited model capacity
improper model initialization
appropriateness of architecture
(inductive biases) - data-driven limitations:
limited amount of data
data quality
What is meant by “data quality”?
1) appropriateness
(eg highly pixelated image would not be good for image classification task)
2) cleanliness
(how accurate was the labeling done? outliers in the dataset?)
3) generalizability
(are there domain shifts? greyscale training images are useless to train RGB models)
How can we improve the data quality?
- only use appropriate data
- clean data
- carefully check data for domain shifts
Why do we need data augmentation?
to increase the size of training dataset synthetically
What kinds of data augmentation are there?
- original
- horizontal flip
- vertical flip
- contrast variations
- image blocking
–> can also be combined
How can models be pre-trained?
Through transfer learning
–> initialize model parameters with those from a model of the same architecture
that was previously trained on similar data
How can the capacity of the model be improved?
- deeper models have more layers than others
- wider models have more neurons in a single layer
What are issues when training large networks?
backpropagation gets more complicated for a large number of network layers:
- gradients can vanish (eg with sigmoid function, large positive numbers go to zero)
- gradients can explode
How can we avoid vanishing gradients?
batch normalization (BatchNorm) on every layer
take outputs and normalize them, scale before going through the activation function
How do we get rid of exploding gradients?
residual connections
only learn the residuals that typically have less extreme gradients –> learn the differences/delta gradients between expected outputs and the output of the layer
What is ResNets?
takes advantage of residual connections as well as BatchNorm
–> are very deep! up to 101 layers
How do most supervised tasks work?
discriminative (discriminate between different choices)
How is the U-Net build?
encoder-decoder = autoencoder architectures
What is the Code (= bottleneck) between encoder and decoder layers?
goal: a meaningful representation of the data
What is one way to perform representation learning?
autoencoder
What are autoencoders used for?
- representation learning
- data denoising
- anomaly detection
What can you do with decoders if they are trained successfully?
use them to generate data from noise
–> standalone decoders can be called generators
What are adversarial attacks?
stack a barely visible rgb noise image on top of another image and the model should not be confused by this
What does GAN stand for?
generative adversarial network
What is the general idea behind a GAN?
have generator G create fake samples and try to trick discriminator D into thinking they are real samples
–> two-player minimax game
How do GANs work? (steps)
1) D tries to maximize objective function (succeeds in identifying real samples) - pursues classification task
2) G tries to minimize objective function (succeeds i generating seemingly real samples)
Training: iterate between training D and G (with backdrop) until D says 50% chance of being real or fake
How do diffusion models work?
the generator (now acting as an encoder) is trained to make sense of increasingly noisy data
–> turn highly noisy latent representations into realistic images
the latent representation is created by a large language model
What is meant by the concept “attention” for CNNs?
which parts of the input data are important?
What does attention in NLP enable?
enable each element of the input sequence to attend to any element of the output sequence
–> transformer models implement this attention mechanism