Mention ways to make the decrease of J faster
Normalize the input data
Use gradient descent with momentum
RMSprop
Initialize the weights randomly and with different ways of initialization to make weights not be too large
Mini batch gradient descent
Adam optimization algorithm is both gradient descent with momentum and RMSprop in two moments
In very high dimensional spaces it is most likely that the gradient descent process gives us a local minimum than a saddle point of the cost function. True/False?
False
What are the steps required to create the mini batches
First shuffling the dataset and then partitioning making sure the last mini batch has size starting at the last one until m which is the total of examples
What is the usual size of batches
Numbers equal to powers of 2, like 64 or similar
What is momentum doing in gradient descent
It is the exponentially weighted average of the gradient on previous steps so there is less oscillation
What are the steps required to use momentum in gradient descent
First, initialize v to zeros, this is for each dW and db
What are the usual recommended values of the hyperparameters alpha, beta 1, beta 2 and epsilon
Alpha is something that needs to be tuned
Beta 1 is the momentum and usually is 0.9
Beta 2 is the RMSprop and 0.999
Epsilon usually a low number 10 to the power of -8
What are the steps required to update the parameters with momentum
After initializing the velocities to zero,
1. Compute the new velocity of that parameter using the beta
2. Update the parameters with that velocity
How do you implement Adam optimization
First initialize v s to zeros
Then compute the velocity, then the corrected velocity
Then the s , then the s corrected
Finally update the parameters using v and s corrected with epsilon
You need to make the model run faster and converge faster, what are different options to use
Mini batch gradient descent
Momentum in gradient descent
Adam (momentum + RMSprop)
What is learning decay
It is making learning decrease so basically we move smaller steps forward as we do more iterations and we get closer to convergence
What is a problem that can occur if we add learning decay
We might make the learning rate go to zero because with every iteration it decreases so it can quickly become zero and stop the learning
What is a fix for learning decay not becoming zero quickly
Adding fixed interval scheduling. This is done in the same formula by dividing the epochNum by the time interval