Quiz 1 - Linear Classifiers, Gradient Descent, Neural Networks Flashcards by Hugo Tdn

Feedforward Neural Network

Approximate a function using a mapping that finds optimal parameters for function approximation. The layers of a neural network are a Directed Acyclic Graph (DAG)

How well did you know this?

Not at all

Perfectly

Default recommendation for activation function of modern neural networks

rectified linear unit (ReLU)

How well did you know this?

Not at all

Perfectly

What is the major difference between neural networks and basic linear models?

Nonlinearity of a neural network causes lost functions to become convex. As a result, nn usually train by iterative, gradient-based optimizers that minimize cost (rather than get cost to 0).

For stocastic gradient descent of nonconvex loss functions, there is no guarantee of convergence and sensitivity to initial parameter values.

How well did you know this?

Not at all

Perfectly

What is regularization

Modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

Example: weight decay to prevent weight parameters from getting too large (and potentially overfitting to train data).

How well did you know this?

Not at all

Perfectly

Neural Network cost function (most common)

NN are trained on maximum likelihood and the cost function is the negative log-likelihood.

How well did you know this?

Not at all

Perfectly

Negative log-likelihood cost function

The cross-entropy between the training data and the model distribution (measure of difference between two probability distributions)

How well did you know this?

Not at all

Perfectly

___ and ___ often lead to poor results when used with gradient-based optimization

1) mean squared error
2) mean absolute error

How well did you know this?

Not at all

Perfectly

What is the purpose of the sigmoid output

Sigmoid activation function converts the output to a probability (0,1) while ensuring that there is a strong gradient for wrong answers.

How well did you know this?

Not at all

Perfectly

Many objective functions other than log-likelihood do not work as well with the softmax function. Why?

Objective functions that do not use a log to undo the exponent of the softmax fail to learn when the argument to the exponent becomes very negative, causing the gradient to vanish.

How well did you know this?

Not at all

Perfectly

Common hidden layer unit

Relu

Differentiable (except at 0)
outputs zero across half its domain

How well did you know this?

Not at all

Perfectly

Different types of ReLU functions

ReLU with non-zero slope for z_i < 0
- Absolute Value Rectification
  - slope is non-zero and set to -1
  - used for object recognition from images
- Leakly ReLU
  - slope is non-zero and small, around 0.01
- parametric ReLI
  - slope of ReLU is a learnable parameter

How well did you know this?

Not at all

Perfectly

What is the problem with using the sigmoid function in a hidden layer?

Sigmoidal units saturate across most of their domain.

When z is strongly positive, they saturate to a high value and when z is strongly negative, they saturate to a low value.

The sigmoidal function is only strongly sensitive near z = 0.

How well did you know this?

Not at all

Perfectly

What is typically true of deeper neural networks?

They are often able to use far fewer units per layer and far fewer parameters

How well did you know this?

Not at all

Perfectly

universal approximation theorem

a feedforward network with a linear output layer and at least one hidden layer with any squashing activation function can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units.

How well did you know this?

Not at all

Perfectly

Forward Propagation

The neural network input provides information that propagates up the hidden units at each layer and finally produces the output y’

How well did you know this?

Not at all

Perfectly

back-propagation “backprop”

Allows information from the cost to flow backward through the network in order to compute the gradient

How well did you know this?

Not at all

Perfectly

Scalar derivative rules

(fill out the table)

How well did you know this?

Not at all

Perfectly

What is the gradient of a function f(x,y)?

The vector of its partial derivatives

[df(x,y)/dx, df(x,y)/dy]

How well did you know this?

Not at all

Perfectly

What is the Jacobian of two functions f(x,y) and g(x,y)?

Study These Flashcards

The matrix of the partial derivatives (the gradients for each function are rows)

for neural networks, what is the dot product of vector:

w . x

Study These Flashcards

The dot product of w . x is the summation of the element-wise multiplication of the elements

(ie. w . x = w^Tx)

What does a linear classifier consist of?

Study These Flashcards

an input
a function of the input
a loss function

* One function decomposed into building blocks

What modulates the output of the “neuron”

Study These Flashcards

a non-linear function (ie. sigmoid) modulates the neuron output to acceptable values.

The Linear Algebra View of a 2-layer neural network

Study These Flashcards

The second layer (hidden) between the input and output layers corresponds to adding another weight matrix to the network

f(x, W₁, W₂) = sig(W₂ sig(W₁x))

Two layered networks can represent any _____ function

Study These Flashcards

continous

Three layered neural networks can represent any ____ functions

(leave blank) theoretically 3 layer neural networks can represent any function although in practice it may require exponentially large number of nodes

What is the general framework for NN Computation graphs

* Directed Acyclic graphs (DAG) * Modules in graph must be differentiable for gradient descent * Training algorithm processes the graph one module at a time * compositionality is achieved by this process

Computation Graph example for NN -log (1 / (1 + e^-wx) )

Overview of backpropagation

* Calculate the current model's outputs (outputs from hidden layer l - 1) * aka forward pass * Calculate the gradients for each model * aka backward pass

Backward pass algorithm "backpropagation"

* start at a loss function to calculate gradients * calculate gradients of the loss w.r.t. module's parameters * progress backwards through modules * given gradient of the output, compute gradient of the input and pass it back * end in the input layer where no gradient needs to be computed.

Backpropagation is the application of ___ to a ___ via the \_\_\_

1. gradient descent 2. computation graph 3. chain rule

How do you compute the gradients of the loss (circled below)?

reverse-mode automatic differentiation

* Given an ordering (ie. a DAG), iterate from the last module backwards, applying the chain rule, then store (for each node), its gradient outputs for efficient computation * forward pass: store activation * backwards pass: store the gradient outputs *

Auto-Diff

A family of algorithms for implementing chain-rule on computation graphs

What computation is performed for gradients from multiple paths?

Patterns of Gradient Flow: Addition

Addition operation distributes gradients along all paths

Patterns of Gradient Flow: Multiplication

Multiplication operation is a gradient switcher (multiplies it by the value of the other term)

Patterns of Gradient Flow: Max Operator

Gradient flows along the path that was selected to be the max (which must be recorded in the forward pass)

What is one of the most important aspects in deep neural networks that can cause learning to slow or stop if not done properly?

\* the flow of gradients \*

What is forward- mode automatic differentiation

start from inputs and propagate gradients forward (no backwards pass) \*not common in deep learning because inputs are large (ie. images) and outputs (loss) are small

Differentiable programming

* Computational graphs are not limited to mathematical functions * can have control flows (statements, loops) * backpropagate through algorithms * done dynamically so gradients are computed then nodes are added in repeat

Derivative of sigmoid?

sigmoid(x)\*(1 - sigmoid(x))

derivative of cos(x)

-sin(x)

derivative of tanh(x)

sec²(x)

Quiz 1 - Linear Classifiers, Gradient Descent, Neural Networks Flashcards

(45 cards)