# Training deep neural net Flashcards

When training a (DNN) is it ok to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

No, all weights should be sampled independently; they should not all have the same initial value. One important goal of sampling weights randomly is to break symmetry.

When training a DNN is it OK to initialize the bias terms to 0?

It is perfectly fine to initialize the bias terms to zero. Some people like to initialize them just like weights, and that okay too; it does not make much difference.

When training a DNN name three advantages of SELU activation function over ReLU.

A few advantages of the SELU function over the ReLU function are:

1) It can take on negative values, so the average output of the neurons in any given layer is typically closer to zero than when using the ReLU activation function. This helps alleviate the vanishing gradients problem.

2) IT always has a nonzero derivative, which avoids the dying units issue that can affect ReLU units.

3) When the conditions are right, then SELU activation function ensures the model is self-normalized, which solves the exploding/vanishing gradients problems.

When training a DNN in which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

The SELU activation function is a good default. If you need the neural network to be as fast as possible, you can use one of the leaky ReLU variants instead. The simplicity of the ReLU activation function’s makes it many people’s preferred option, despite the fact that it is generally outperformed by SELU and leaky ReLU. However, the ReLU activation function’s ability to output precisely zero can be useful in some cases. Moreover, it can sometimes benefit from optimized implementation as well as from hardware acceleration. The hyperbolic tangent (tanh) can be useful in the output layer if you need to output a number between -1 and 1, but nowadays it is not used much in hidden layers (except in recurrent nets) The logistic activation function is also useful in the output layer when you need to estimate a probability, but is rarely used in hidden layers. Finally, the softmax activation function is useful in the output layer to output probabilities for mutually exclusive classes, but it is rarely used in hidden layers.

When training a DNN, what may happen if you set the momentum hyperparameter too close to 1 (example 0.99999) when using an SGD optimizer?

If you set the momentum hyperparameter too close to 1 (ex: 0.99999) when using an SGD optimizer, then the algorithm will likely pick up a lot of speed, hopefully moving roughly toward the global minimum, but its momentum will carry it right past the minimum. Then it will slow down and come back, accelerate again, overshoot again, and so on. It may oscillate this way many times before converging, so overall it will take much longer to converge than with a smaller momentum value.

When training a DNN, what are three ways you can produce a sparse model.

One way to produce a sparse model is to train the model normally, then zero out tiny weights. For more sparsity, you can apply L1, regularization during training, which pushes the optimizer toward sparsity. A third option is to use the tensorflow model optimization toolkit.

When training a DNN, does dropout slow down training? does it slow down inference (i.e., making prediction on new instances)? What about MC dropout?

Yes dropout does slow down training, in general roughly by a factor of two. However, it has no impact on inference speed since it is only turned on during training. MC dropout is exactly like dropout during training, but it is still active during inference, so each inference is slowed down slightly. More importantly, when using MC dropout you generally want to run inference 10 times or more to get better predictions. This means that making prediction is slowed down by a factor of 10 or more.