Neural Networks and Deep Learning Flashcards

1
Q

Neuron, activation function, network achitecture, point of view of one node, hypothesis set, matrix notation

A

Neuron: function x –> sigma(<v,x>)

Activation function: sigma: R -> R
examples:
- sign
-treshold
-sigmoid

Network architecture: (V,E, sigma)
vertices, edges, functino

Point of view of one node

Hypothesis set: H(V,E, sigma) = {hV,E,sigma,w : w is a mapping from E to R}
w are the weights

Matrix notation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

General construction of NN for a given Boolean formula

A

Let’s take an arbitrary function f: {-1,1} –> {-1,1}

Goal: build a NN that corresponds to f ( if the input is x, then the prediction of such NN is f(x))

  • consider x such that f(x) = 1 : for each such x there is a neuron in the only hidden layer that corresponds to x. The neuron implements:

gi(x) ) = sign(<x,x’> -d+19

output node: “implements” h(x’) = sign (SUM gi(x) +k-1)
where k = # of vectors x such that f(x) = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Expressiveness of NNs

A

every Boolean function can be implemented using a neural network of depth 2. NNs are universal approximatros.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sample complexity, runtime of learning NNs

A

Sample complexity: quantity of data needed to learn with NN
- VC-dim of HV,E,sign = O(|E|log|E|)
- VC-dim of HV,E,sigmoid = O(|V|^2log|E|^2)
Large NNs require a lot of data

Runtime of Learning: applying the ERM rule with respect to HV,E,sign is NP hard
So we train NN using Stochastic Gradient Descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Forward propagation algorithm

A

PSEUDOCODE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

SGD and Backpropagation algorithm (pseudocode: only structure)

A

Based on SGD

PSEUDOCODE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regularized NNs

A

Instead of training a NN by minimizing Ls(h), find h that minimizes

Ls(h) + lambda/2 SUM/w(t))^2

where lambda is the regularization parameter

We find h by SGD or improved algorithms.

This is caalled squared weight decay regularizer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly