Chapter 11 Flashcards by Gourgey Hats

What are common challenges when training deep neural networks?

Vanishing/exploding gradients, insufficient data, slow training, and overfitting.

How well did you know this?

Not at all

Perfectly

What causes the vanishing gradient problem in deep neural networks?

Gradients get smaller during backpropagation, often due to the sigmoid activation function and poor initialization.

How well did you know this?

Not at all

Perfectly

What is the exploding gradient problem?

Gradients increase exponentially during backpropagation, causing divergence during training.

How well did you know this?

Not at all

Perfectly

How do Glorot and He initialization help in training?

They maintain variance across layers, helping stabilize the signal during forward and backward passes.

How well did you know this?

Not at all

Perfectly

What activation function is commonly used to avoid vanishing gradients?

ReLU (Rectified Linear Unit), though it can suffer from the ‘dying ReLU’ problem.

How well did you know this?

Not at all

Perfectly

What is Leaky ReLU?

A ReLU variant that allows a small gradient when the neuron is not active, preventing dying neurons.

How well did you know this?

Not at all

Perfectly

What is ELU and its advantage?

Exponential Linear Unit, which speeds up training and avoids vanishing gradients due to its non-zero mean and negative slope.

How well did you know this?

Not at all

Perfectly

What is SELU?

Scaled ELU that self-normalizes layers, requiring specific input standardization and initialization.

How well did you know this?

Not at all

Perfectly

What is batch normalization?

A technique that normalizes layer inputs to reduce vanishing/exploding gradients and speed up training.

How well did you know this?

Not at all

Perfectly

What does batch normalization do during training?

It zero-centers, normalizes inputs, and then rescales and shifts using learned parameters.

How well did you know this?

Not at all

Perfectly

What is gradient clipping used for?

To address exploding gradients by limiting the gradient values during backpropagation.

How well did you know this?

Not at all

Perfectly

What is transfer learning?

Reusing layers from a pretrained network on a similar task to reduce training time and improve performance.

How well did you know this?

Not at all

Perfectly

How does transfer learning work in Keras?

By transferring and locking lower layers from a trained model and training the upper layers on a new task.

How well did you know this?

Not at all

Perfectly

What is unsupervised pretraining?

Training each layer of a network sequentially using unlabeled data, followed by supervised fine-tuning.

How well did you know this?

Not at all

Perfectly

What are examples of faster optimizers?

Momentum, Nesterov Accelerated Gradient, AdaGrad, RMSProp, Adam, and Nadam.

How well did you know this?

Not at all

Perfectly

What is momentum optimization?

Study These Flashcards

An optimizer that accelerates gradient descent by building momentum from past gradients.

What is Nesterov Accelerated Gradient (NAG)?

Study These Flashcards

A variation of momentum that looks ahead in the direction of momentum for more accurate gradient estimation.

What is AdaGrad?

Study These Flashcards

An optimizer that adapts learning rates based on past gradients, but may converge too early.

What is RMSProp?

Study These Flashcards

An optimizer that improves AdaGrad by decaying past gradients, helping avoid premature convergence.

What is Adam optimizer?

Study These Flashcards

Combines momentum and RMSProp by tracking both past gradients and squared gradients with bias correction.

What is Nadam?

Study These Flashcards

Adam optimizer with the Nesterov momentum technique.

Why are second-order optimizers rarely used in deep learning?

Study These Flashcards

They require Hessians, which are computationally expensive for large networks.

What is a learning rate schedule?

Study These Flashcards

A strategy to change the learning rate during training to balance speed and convergence.

Name some learning rate scheduling strategies.

Study These Flashcards

Power scheduling, exponential scheduling, piecewise constant scheduling, and performance scheduling.

What is regularization?

Techniques to prevent overfitting in deep neural networks, such as L1/L2 regularization and dropout.

What is dropout regularization?

Randomly dropping units during training to prevent co-dependence among neurons and improve generalization.

What is Monte Carlo dropout?

Using dropout at prediction time to average multiple predictions, improving uncertainty estimation.

What is max-norm regularization?

A constraint that limits the norm of the weight vector for each neuron to prevent large weights.

What practical steps help train deep neural networks effectively?

Use proper initialization, batch normalization, dropout, fast optimizers, and transfer learning when possible.

Chapter 11 Flashcards

(29 cards)