Chapter 11 Flashcards
(29 cards)
What are common challenges when training deep neural networks?
Vanishing/exploding gradients, insufficient data, slow training, and overfitting.
What causes the vanishing gradient problem in deep neural networks?
Gradients get smaller during backpropagation, often due to the sigmoid activation function and poor initialization.
What is the exploding gradient problem?
Gradients increase exponentially during backpropagation, causing divergence during training.
How do Glorot and He initialization help in training?
They maintain variance across layers, helping stabilize the signal during forward and backward passes.
What activation function is commonly used to avoid vanishing gradients?
ReLU (Rectified Linear Unit), though it can suffer from the ‘dying ReLU’ problem.
What is Leaky ReLU?
A ReLU variant that allows a small gradient when the neuron is not active, preventing dying neurons.
What is ELU and its advantage?
Exponential Linear Unit, which speeds up training and avoids vanishing gradients due to its non-zero mean and negative slope.
What is SELU?
Scaled ELU that self-normalizes layers, requiring specific input standardization and initialization.
What is batch normalization?
A technique that normalizes layer inputs to reduce vanishing/exploding gradients and speed up training.
What does batch normalization do during training?
It zero-centers, normalizes inputs, and then rescales and shifts using learned parameters.
What is gradient clipping used for?
To address exploding gradients by limiting the gradient values during backpropagation.
What is transfer learning?
Reusing layers from a pretrained network on a similar task to reduce training time and improve performance.
How does transfer learning work in Keras?
By transferring and locking lower layers from a trained model and training the upper layers on a new task.
What is unsupervised pretraining?
Training each layer of a network sequentially using unlabeled data, followed by supervised fine-tuning.
What are examples of faster optimizers?
Momentum, Nesterov Accelerated Gradient, AdaGrad, RMSProp, Adam, and Nadam.
What is momentum optimization?
An optimizer that accelerates gradient descent by building momentum from past gradients.
What is Nesterov Accelerated Gradient (NAG)?
A variation of momentum that looks ahead in the direction of momentum for more accurate gradient estimation.
What is AdaGrad?
An optimizer that adapts learning rates based on past gradients, but may converge too early.
What is RMSProp?
An optimizer that improves AdaGrad by decaying past gradients, helping avoid premature convergence.
What is Adam optimizer?
Combines momentum and RMSProp by tracking both past gradients and squared gradients with bias correction.
What is Nadam?
Adam optimizer with the Nesterov momentum technique.
Why are second-order optimizers rarely used in deep learning?
They require Hessians, which are computationally expensive for large networks.
What is a learning rate schedule?
A strategy to change the learning rate during training to balance speed and convergence.
Name some learning rate scheduling strategies.
Power scheduling, exponential scheduling, piecewise constant scheduling, and performance scheduling.