Quiz 1 - Linear Classifiers, Gradient Descent, Neural Networks Flashcards

1
Q

Feedforward Neural Network

A

Approximate a function using a mapping that finds optimal parameters for function approximation. The layers of a neural network are a Directed Acyclic Graph (DAG)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Default recommendation for activation function of modern neural networks

A

rectified linear unit (ReLU)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the major difference between neural networks and basic linear models?

A

Nonlinearity of a neural network causes lost functions to become convex. As a result, nn usually train by iterative, gradient-based optimizers that minimize cost (rather than get cost to 0).

For stocastic gradient descent of nonconvex loss functions, there is no guarantee of convergence and sensitivity to initial parameter values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is regularization

A

Modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

Example: weight decay to prevent weight parameters from getting too large (and potentially overfitting to train data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Neural Network cost function (most common)

A

NN are trained on maximum likelihood and the cost function is the negative log-likelihood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Negative log-likelihood cost function

A

The cross-entropy between the training data and the model distribution (measure of difference between two probability distributions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

___ and ___ often lead to poor results when used with gradient-based optimization

A

1) mean squared error
2) mean absolute error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the purpose of the sigmoid output

A

Sigmoid activation function converts the output to a probability (0,1) while ensuring that there is a strong gradient for wrong answers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Many objective functions other than log-likelihood do not work as well with the softmax function. Why?

A

Objective functions that do not use a log to undo the exponent of the softmax fail to learn when the argument to the exponent becomes very negative, causing the gradient to vanish.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Common hidden layer unit

A

Relu

  • Differentiable (except at 0)
  • outputs zero across half its domain
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Different types of ReLU functions

A
  • ReLU with non-zero slope for zi < 0
    • Absolute Value Rectification
      • slope is non-zero and set to -1
      • used for object recognition from images
    • Leakly ReLU
      • slope is non-zero and small, around 0.01
    • parametric ReLI
      • slope of ReLU is a learnable parameter
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the problem with using the sigmoid function in a hidden layer?

A

Sigmoidal units saturate across most of their domain.

When z is strongly positive, they saturate to a high value and when z is strongly negative, they saturate to a low value.

The sigmoidal function is only strongly sensitive near z = 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is typically true of deeper neural networks?

A

They are often able to use far fewer units per layer and far fewer parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

universal approximation theorem

A

a feedforward network with a linear output layer and at least one hidden layer with any squashing activation function can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Forward Propagation

A

The neural network input provides information that propagates up the hidden units at each layer and finally produces the output y’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

back-propagation “backprop”

A

Allows information from the cost to flow backward through the network in order to compute the gradient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Scalar derivative rules

(fill out the table)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the gradient of a function f(x,y)?

A

The vector of its partial derivatives

[df(x,y)/dx, df(x,y)/dy]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the Jacobian of two functions f(x,y) and g(x,y)?

A

The matrix of the partial derivatives (the gradients for each function are rows)

20
Q

for neural networks, what is the dot product of vector:

w . x

A

The dot product of w . x is the summation of the element-wise multiplication of the elements

(ie. w . x = wTx)

21
Q

What does a linear classifier consist of?

A
  • an input
  • a function of the input
  • a loss function

* One function decomposed into building blocks

22
Q

What modulates the output of the “neuron”

A

a non-linear function (ie. sigmoid) modulates the neuron output to acceptable values.

23
Q

The Linear Algebra View of a 2-layer neural network

A

The second layer (hidden) between the input and output layers corresponds to adding another weight matrix to the network

f(x, W1, W2) = sig(W2 sig(W1x))

24
Q

Two layered networks can represent any _____ function

A

continous

25
Q

Three layered neural networks can represent any ____ functions

A

(leave blank)

theoretically 3 layer neural networks can represent any function although in practice it may require exponentially large number of nodes

26
Q

What is the general framework for NN Computation graphs

A
  • Directed Acyclic graphs (DAG)
  • Modules in graph must be differentiable for gradient descent
  • Training algorithm processes the graph one module at a time
  • compositionality is achieved by this process
27
Q

Computation Graph example for NN

-log (1 / (1 + e-wx) )

A
28
Q

Overview of backpropagation

A
  • Calculate the current model’s outputs (outputs from hidden layer l - 1)
    • aka forward pass
  • Calculate the gradients for each model
    • aka backward pass
29
Q

Backward pass algorithm “backpropagation”

A
  • start at a loss function to calculate gradients
    • calculate gradients of the loss w.r.t. module’s parameters
  • progress backwards through modules
    • given gradient of the output, compute gradient of the input and pass it back
  • end in the input layer where no gradient needs to be computed.
30
Q

Backpropagation is the application of ___ to a ___ via the ___

A
  1. gradient descent
  2. computation graph
  3. chain rule
31
Q

How do you compute the gradients of the loss (circled below)?

A
32
Q

reverse-mode automatic differentiation

A
  • Given an ordering (ie. a DAG), iterate from the last module backwards, applying the chain rule, then store (for each node), its gradient outputs for efficient computation
    • forward pass: store activation
    • backwards pass: store the gradient outputs
      *
33
Q

Auto-Diff

A

A family of algorithms for implementing chain-rule on computation graphs

34
Q

What computation is performed for gradients from multiple paths?

A
35
Q

Patterns of Gradient Flow: Addition

A

Addition operation distributes gradients along all paths

36
Q

Patterns of Gradient Flow: Multiplication

A

Multiplication operation is a gradient switcher (multiplies it by the value of the other term)

37
Q

Patterns of Gradient Flow: Max Operator

A

Gradient flows along the path that was selected to be the max (which must be recorded in the forward pass)

38
Q

What is one of the most important aspects in deep neural networks that can cause learning to slow or stop if not done properly?

A

* the flow of gradients *

39
Q

What is forward- mode automatic differentiation

A

start from inputs and propagate gradients forward (no backwards pass)

*not common in deep learning because inputs are large (ie. images) and outputs (loss) are small

40
Q

Differentiable programming

A
  • Computational graphs are not limited to mathematical functions
    • can have control flows (statements, loops)
    • backpropagate through algorithms
    • done dynamically so gradients are computed then nodes are added in repeat
41
Q

Derivative of sigmoid?

A

sigmoid(x)*(1 - sigmoid(x))

42
Q

derivative of cos(x)

A

-sin(x)

43
Q

derivative of tanh(x)

A

sec2(x)

44
Q
A
45
Q
A