C4 Flashcards

1
Q

exploding/vanishing gradients

A

the deeper you go in the network, the more multiplications you have: around 2xdepth multiplications

products of small numbers are very small, products of big numbers are very big

solution: alternative activation functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

alternative activation functions

A
  • Logistic sigmoid
  • TanH
  • Linear identity
  • ReLu (Rectified Linear Unit)
  • LReLU (Leaky Rectified Linear Unit)
  • ELU (Exponential Linear Unit)
  • SELU (Scaled Exponential Linear Unit)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

batch normalization

A

When training a network with batches of data the network “gets confused” by the fact that statistical properties of batches vary from batch to batch

Idea 1: normalize each batch => subtract the mean and divide by the std deviation

Idea 2: assume that it is beneficial to scale and to shift each batch by a certain gamma and beta, to minimize network loss (error) on the whole training set

Idea 3: Finding optimal gamma and beta can be achieved with SGD (gradient descent)

Batch Normalization allows higher learning rates, reducing the number of epochs; consequently, it is much faster than other training algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

advantages Batch Normalization

A
  • superior accuracy
  • reduces risk of vanishing/exploding gradients
  • much faster than traditional backpropagation
  • allows for using “big learning rates” => less epochs needed for convergence
  • allows for training much deeper networks
  • increases regularization: lower risk of overfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

regularization

A

add additional mechanism to prevent overfitting
L1 or L2 regularization: penalty on too big values of weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly