Gradient Descent

Gradient Descent is an algorithm which is used in deep learning such as in Neural Networks.

Gradient descent works by taking the current weights and subtracting the partial derivative of the current loss w.r.t. the weights which is multiplied by some learning rate \alphaα.

As the weights change, the loss changes as well. This is often somewhat smooth locally, so small changes in wrights produce small changes in loss. We can think about iterative algorithms that take current values of weights and modify them a little bit to reduce loss and then repeats this until finding the local minimum.

We can find the steepest descent by computing the derivative

f’(a) = lim f(a+h) - f(a) / h

The steepest descent direction is the negative gradient

Intuitively: Measures how the function changes as the argument changes by a small step size, as the step goes to 0

Examples of non-parametric model

KNN and decision tree

Components of a parametric learning algorithm

Input(and representation) Functional form of the model Performance measure to improve Loss or objective function Algorithm for finding the best parameters Optimization algorithm

WHat are methods for finding best set of weights

Random search, genetic, algorithms, greaident based learning

Full batch gradient descent

we calculate loss as loss of average loss across all items in our dataset.

Small batch gradient descent

we take a small subset of data and calculate the loss over those examples and then we computer gradient and take a step in the weight space and then we take another minibatch. We often average loss over mini - batches to prevent large changes in he learning rate.

WHen is gradient descent guarenteed to converge?

GD is guaranteed to converge in limited circumstances, The LR has t be appropriately reduced. GD converges to local minima. SOme of the local minima it finds are pretty strong.

Describe distributed representation as it pertains to deep learning:

NO single neuron encodes everything and groups of neurons work together.

SImilarities between LInear classifier and Neural Network

A neuron takes input(firings) from other neurons(-> input to linear classifier)

The inputs are summed up in a weighed manner(-> weighted sum).

Learning is through a modification of the weights

If it received enough input, it fires(threshold or if weighted sum plus bias is high enough

How many layers does it take to learn any continuous function

Two layered networks can represent any continuous function

HOw many layers does ti take to learn any function?

3

Examples of activation functions

sigmoid -> 1/1+e^-x

reLU -> max(0, h to the l-1)

WHat is regularization for?

To help with overfitting

L1 Regularization

L1 Regularization uses the regularization term |W|∣W∣ which is the L1 norm of the weight matrix. L1 Regularization will encourage the network to learn weights that are small. L1 regularization promotes the network to not rely on any particular weight while also promoting sparsity due to the weights being very small, very close to zero and only a few non zero values.

∣y −Wxj| ^2 + |W|

When is gradient descent guaranteed to converge?

Gradient Descent is guaranteed to converge IF the learning rate is adequately decayed. Gradient descent does not guarantee convergence to the global minima, rather some local minima, which could be the global minima.

Local Minima vs Global Minima

Gradient Descent Properties such as the local minima can be defined as the point (locally) which produces results in which the loss is no longer reducing or has been reduced sufficiently (sufficiently being subjective), which is another way of saying that small changes in the weights would not decrease the loss.

The global minima does not mean the same thing as the local minima, but the global minima can equal the local minima.

The global minima is the lowest point of the entire space. Local minima is the lowest spot in a neighborhood (sub-region).

When is something considered to be differentiable?

A function is differentiable if the function is continuous, meaning the graph of the function must not be broken. There cannot be ‘corners’ in the function as well as cannot having any points which have undefined slopes.

Every point is differentiable

Derivative of sin(x)

cos(x)

Derivative of cos(x)

-sin(x)

Derivative of tan(x)

sec^2(x)

Derivative of cot x

-csc^2x

Derivative of sec x

tan x sec x

Derivative of csc x

-cot x csc x

What are the 3 ways we know a function is differentiable?

1) Cusp or corner(sharp turn)

2) Discontinuity(jump, point, or infinite)

3) Vertical Tangent(undefined slope)

Piecewise differentiable

You are just checking each piece for differentiability

How to prove a function is continuous

1) f(a) is defined

2) Show that the limit of f(x) as a approaches 0 exists

3) Show that the limit of f(x) is equal to the limit of f(a)

L2 regularization

L2 Regularization used the regularization term |W|^2 , which is the sum of the squares of the weights. L2 regularization adds “squared magnitude” of coefficient as penalty term to the loss function.

|y - Wxj|^2 + |W|^2

Backpropagation computational graph

1) Calculate the model current outputs(called the forward pass) on some mini batch

2) Calculate the gradients for each module(called the backward pass) w.r.t parameters

2a) FOr the backward pass, we start with the loss function where we know how to calculate the loss. We calc the change in loss with respect to the parameters in order to perform gradient descent. We calculate the change in loss with respect to the inputs in order to send us back to the previous layer

Now the previous layer knows how the loss changes when the output changes. It uses the chain rule to know how the loss changes if our inputs change or if our weights or parameters change. The upstream gradient will then be passed back to our first layer

2a) We pass back gradients to each layer, specifically the partial derivative of the Loss with respect to the models parameters. The partial derivative with respect to the input from the prior layer - dl with respect to d h ^l -1, dl with respect to the dh^l. These derivatives will give us a sense of how our loss will change if we change these parameters

2b) We end at the input layer

3) Use gradient descent to update parameters at the end. Since we have the partial derivatives for the weights with respect to all the weights, we can simply perform wi - wi - alpha * dl/dwi where alpha is equal to some learning rate.

Back prop is just the application of gradient descent to a computation graph via gradient descent

Upstream gradient

The derivative of the loss with respect to h^l - dl/dh^l

Calculating local gradients

It is the change of outputs with respect to our inputs or the change in outputs with respect to our weights. We need the first derivative so we can pass it back to previous modules. We need the second derivative for gradient descent so we can update the weights themselves.

{dh^l/dh^l-1 , dh^l/dw}

Chain rule to compute the gradient of the loss w.r.t the inputs, otherwise known as the upstream gradient

Chain rule: dz / dx= dz/dy * dy/dx

The gradient of the loss with respect to it’s inputs is equal to the gradient of the loss with respect to it’s outputs * by the gradient of its outputs with respect to its inputs.

In order for us to know how the loss changes with respect to the inputs, we also need to know how the loss changes with respect to the outputs.

This is considered the upstream gradient because it is what we receive from the subsequent layer.

dl/dh^l-1 = dl/dh^l-1 * dh^l/dh^l-1

Chain rule to compute the gradient of the loss w.r.t the weights,

dl/dW = al/ah^l * ah^l/aW

The change in loss with respect to our inputrs is multiplied by change in our outputs with respect to our models parameters