Deep Learning Flashcards
(46 cards)
What is an activation function? Why is it important?
The activation function is the function the neuron applies to the weighted sum of its inputs. It is important because it is how you introduce nonlinearity, since the function can be anything.
What is a step function?
A function that produces binary output based on some threshold.
What is a perceptron?
A perceptron is a neuron with the step function z > 0.
What does MLP stand for?
Multi-layer perceptron.
What is a multi-layer perceptron?
It’s when you chain together multiple layers of logic gates built from perceptrons, to represent complex logical functions.
What is “universality”?
If a network has universality, then it can approximate any continuous function arbitrarily with just one hidden layer, given enough units.
The definition of universality talks about one hidden layer. If that’s the case, why would you want multiple layers?
You can represent things more compactly if you use multiple layers.
What does “differentiable” mean?
A function is differentiable if you can find the slope of its tangent at any point.
What is ReLU?
ReLU is an activation function that is 0 for x < 0 and x for x > 0.
Why do we like to use ReLU instead of step function?
The step function has no useful gradient since it is just flat everywhere. ReLU does have a useful gradient.
What is risk? What is empirical risk?
Risk is the actual expected loss over the entire real distribution of data. We don’t know the real distribution of data, we just have our training set, so we only have empirical risk (the expected loss over the training set distribution).
What’s the point of the VC Dimension?
The VC dimension proves that the generalization bound is finite, even for an infinite hypothesis set. Generalization bound is risk <= empiricalRisk + error. And VC dimension says that error is finitely bounded.
What is the objective function?
It’s the loss function that we want to minimize.
What is the vanishing gradient problem?
It’s when the gradient goes to zero – which often happens because you’re multiplying a lot of tiny weights together.
What is it called when the gradient goes to zero?
The vanishing gradient problem.
What is “over-saturation”?
It’s when the activation function is in a flat region – such as < 0 for the ReLU – so the gradient is zero, which causes the vanishing gradient problem.
Give three solutions to the vanishing gradient problem
Any of these four: Better initialization, Better activation function, Regularization to rescale the gradient, Gradient clipping
What does it mean if a model is “non-identifiable”?
It means that the model can get the same minimum loss with multiple different settings of the weights.
Why is it a problem is a model is non-identifiable?
If we were using weights to draw conclusions about the model, having two different sets of weights both claim to be optimal would make that difficult.
What does it mean if a function is “convex”?
It means that it has a single minimum, and a bowl shape going towards that minimum. Specifically, it means that if you pick any two points and draw a line between them, that line won’t intersect the function.
Are deep neural networks convex?
Absolutely not.
Give three examples of problems caused by nonconvexity
Any of: Local minima, Saddle points, Cliff structures, Asymptotes
Why don’t you want to just initialize all the weights to zero?
For ReLU, this yields zero gradients. In general, this means every hidden unit with the same activation function would have the same output.
What is the most common way to initialize?
Sampling from a distribution