Tentamen Flashcards

1
Q

softmax vs sigmoid

A

softmax is voor als de output categorisch is en sigmoid wanneer de output continu is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Cover’s theorem

A

in a highly dimensional space, if the number of points is relatively small compared to the dimensionality and you paint the points randomly in two colors, the data set will be linearly separable

  1. if the number of points in a d-dimensional space is smaller than 2*d, they are almost always linearly separable
    if n/(d+1) < 2, then linearly separable
  2. if the number of points in a d-dimensional space is bigger than 2*d, they are almost always NOT linearly separable
    if n/(d+1) > 2, then not linearly separable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

recommended setup for regression problems

A

linear activation function with SSE loss function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Gradient descent with momentum

A

combine current weight update with previous update
p_new = p_old - agradient + blast_direction

ofwel

x_(k+1) = x_k - αg(x_k) + βd(x_k)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Gradient descent

A

p_new = p_old - a*gradient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Biggest advantage of LSTM networks over Vanilla Recurrent Networks

A

Ability of learning which remote and recent information is relevant for the given task and using this information to generate output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

batch normalization

A

The process of finding an optimal transformation of each batch, layer after layer, that is optimized during the training process.

When training a network with batches of data the network “gets confused” by the fact that statistical properties of batches vary from batch to batch

Idea 1: normalize each batch => subtract the mean and divide by the std deviation

Idea 2: assume that it is beneficial to scale and to shift each batch by a certain gamma and beta, to minimize network loss (error) on the whole training set

Idea 3: Finding optimal gamma and beta can be achieved with SGD (gradient descent)

Batch Normalization allows higher learning rates, reducing the number of epochs; consequently, it is much faster than other training algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

NetTalk

A

The task is to learn to pronounce English text from examples (text-to-speech)
- Training data: list of <phrase, phonetic representation>
- Input: 7 consecutive characters from written text presented in a moving window that scans text
- Output: phoneme code giving the pronunciation of the letter at the center of the input window
- Network topology: 7x29 binary inputs (26 chars + punctuation marks), 80 hidden units and 26 output units (phoneme code). Sigmoid units in hidden and output layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

recommended setup for solving binary classification problems with MLP

A

sigmoid and cross-entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

recommended setup for solving multiclass classification problems with MLP

A

softmax and cross-entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

LeNet 5: layer C1

A

Convolutional layer with 6 feature maps of size 28x28
- Each unit of C1 has a 5x5 receptive field in the input layer
- Shared weights (5x5+1)x6=156 parameters to learn
- Connections: 28x28x(5x5+1)x6=122304
- If it was fully connected we had:
(32x32+1)x(28x28)x6 = 4.821.600 parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

LeNet 5: layer S2

A

Subsampling layer with 6 feature maps of size
14x14 2x2 nonoverlapping receptive fields in C1
- 6x2=12 trainable parameters.
- Connections: 14x14x(2x2+1)x6=5880

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

LeNet 5: total

A

The whole network has:
– 1256 nodes
– 64.660 connections
– 9.760 trainable parameters (and not millions!)
– trained with the Backpropagation algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ALVINN

A

Neural network that drives a car
30x32 inputs, 4 hiddens, 30 outputs => 30x32x4 + 4x30 tunable parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

network type most suitable for removing noise from images

A

autoencoder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Linear Separability for multi-class problems

A

There exist c linear discriminant functions
y_1(x),…., y_c(x) such that each x is assigned to class C_k if and only if y_k (x) > y_j(x) for all j neq k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

when do functions not necessarily discriminate sets?

A

Check if the functions are monotonic, if not, they do not necessarily discriminate the sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

number of weights between input layer and first convolutional layer C

A

input comes from the convolutional filter, so size_conv_filter x nodes in C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

learning to play the Atari Breakout game

A

train convolutional network to play Breakout

  • the network takes as input 4 consecutive frames (preprocessed to 4x84x84 pixels) + “reward”;
  • 4 frames are needed to contain info about ball direction, speed, acceleration, etc.
  • output consists of 18 nodes that correspond to all possible positions of the joystick
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What network architecture was used generate the “word to vector” mapping?

A

multi-layer perceptron

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What network architecture was used by the AlphaGo program?

A

ResNet (residual network)
Key idea: it’s easier to learn “the modification of the original image than the modified image” => ad indentity shortcuts between 2 or more layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What network architecture was used generate the Google DeepDream video(s)?

A

convolutional network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the difference between gradient descent and backpropagation?

A

Gradient descent is a general technique for finding (local) minima of a function which involves calculating gradients (or partial derivatives) of the function, while backpropagation is a very efficient method for calculating gradients of “well-structured” functions such as multi-layered networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

AlphaGo Zero

A
  • trained solely by self-play generated data (no human knowledge!)
  • use a SINGLE Convolutional ResNet with two “heads” that model policy and value estimates:
    policy = probability distribution over all possible next moves
    value = probability of winning from the current position
  • extensive use of Monte Carlo Tree Search to get better estimates
  • a tournament to select the best network to generate fresh training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

DQN introduces a ‘replay buffer’ to store observations obtained during training. What is this buffer used for?

A

To avoid correlation between training examples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

GANs

A
  • Generator: generate fake samples, tries to fool the Discriminator
  • Discriminator: tries to distinguish between real and fake samples

Formulated as a minimax game where:
- The Discriminator tries to maximize its reward
- The Generator tries to minimize the Discriminator’s reward (or maximize its loss)

27
Q

Mode Collapse GANs

A

Outputs of the Generator gradually become less diverse: the Generator produces good samples, but very few of them, thus the Discriminator can’t tag them as fake.

28
Q

dropout

A

at every training step, every neuron has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step

29
Q

L1 regularization

A

adds the absolute value of magnitude of the coefficient as a penalty term to the loss function to avoid overfitting

30
Q

Monte Carlo dropout

A

stack the predictions of, say, 100 over the test set, while dropout is active (so all predictions will be different)

31
Q

why is adding layers to a neural network harmful?

A
  • the increased number of weights quickly leads to data
    overfitting (lack of generalization)
  • huge number of (bad) local minima trap the gradient descent
    algorithm
  • vanishing or exploding gradients (the update rule involves
    products of many numbers) cause additional problems
32
Q

why would we need many layers?

A
  • in theory, one hidden layer is sufficient to model any function
    with an arbitrary accuracy, but the number of required nodes and weights grows exponentially fast
  • the deeper the network the less nodes are required to
    model “complicated” functions
  • consecutive layers “learn” features of the training patterns;
    from simplest (lower layers) to more complicated (top layers)
33
Q

perceptron learning algorithm

A
  1. initialize w randomly
  2. while there are misclassified examples:
    - select misclassified example (x,d)
    - w_new = w_old + theta * x*(desired - out)
34
Q

neuron learning rule

A

w = w + thetax(desired - out)*out’(x)

35
Q

support vector machine idea

A

the decision boundary should be as far away from the data of both classes as possible, maximize the margin

support vectors are the training points that are nearest to the separating hyperplane

36
Q

3 weight update strategies

A

full batch mode: weights are updated after all the inputs are processed

(mini) batch mode: weights are updated after a small random sample of inputs is processed (Stochastic Gradient Descent)

on-line mode: weights are updated after processing single inputs

37
Q

types of decision regions

A
  1. network with a single node (een lijn)
  2. one-hidden layer network that realizes the convex region: each node realizes one line bounding this region
  3. two-hidden layer network that realizes the union of 3 convex regions: each box represents a one-hidden layer network realizing one region

zie ook slide 6 van week 3

38
Q

gradient descent method

A

walk in the direction yielding the maximum decrease of the network error E, which is the opposite of the gradient of E

39
Q

3 phases of backpropagation

A
  1. computing the output of the network with corresponding error
  2. computing the contribution of each weight to the error
  3. adjusting the weights accordingly
40
Q

forward pass of backpropagation

A

The network is activated on one example, the error of each neuron of the output layer is computed and the activations of all hidden nodes are computed.

41
Q

backward pass of backpropagation

A

The network error is used for updating the weights. Starting at the output layer, the error is propagated backwards through the network, layer by layer with help of the generalized delta rule. Finally, all weights are updated.

42
Q

advantages SGD

A
  • additional randomness helps to avoid local minima
43
Q

advantages SGD

A
  • additional randomness helps to avoid local minima
  • huge savings of CPU time
  • easy to execute on GPU cards
44
Q

early stopping

A

stop training as soon as the error on the validation set increases

45
Q

exploding/vansishing gradients

A

the deeper you go, the more multiplications you have => products of small numbers are very small, products of big numbers very big

46
Q

how to fix exploding/vansishing gradients?

A
  1. Instead of sigmoid or TanH, use alternative activation functions of which derivatives do not vanish, or only very slowly (ReLU, LReLU, ELU, SELU)
  2. avoid increasing variance of outputs the deeper you go of with traditional initialization => use alternative initialization strategies that use fan_in and fan_out (Glorot, He, LeCun)
  3. batch normalization
47
Q

fan_in and fan_out

A

fan_in = number of connections to the given layer

fan_out = number of connections from the given layer

fan_avg = (fan_in + fan_out) / 2

48
Q

Stochastig Gradient Descent (SGD)

A

evaluate gradients and update the weights with every training example or in mini batches

normal GD takes fewer steps, but each step takes much longer to compute

advantages:
+ Fewer redundant gradient computations, i.e., faster
+ Parallelizable, optional asynchronous updates
+ High-variance updates can hop out of local minima
+ Can encourage convergence by annealing the learning rate

49
Q

Gradient descent Nesterov momentum

A

vul in

50
Q

gradient clipping

A

clip the gradients during backpropagation so that they never exceed some threshold, to avoid exploding gradients (clip to value between certain interval)

51
Q

random surfer model as Markov Process

A

page importance = fequency with which the surfer visits the page

transition matrix used for iterative calculation of the page probability distribution

Markov Process converges if the graph is strongly connected and there are no dead ends

52
Q

digit recognition problem

what accuracy can be achieved if we randomly permute all pixels

A

the same as the original data if we work with a single-layer perceptron or a multi-layer perceptron, because if we permute the weights from input to hidden nodes the network behaves in the same way as the original trained network

53
Q

key idea behind convulutional networks

A

a filter (feature detector) return high values when the corresponding patch is similar to the filter matrix

how do we know what the filters should look like? => instead of hand-crafting, specify each filter with (very few) parameters and find values of these parameters by backpropagation

54
Q

SVM margin gamma

A

the distance of the closest example from the decision line or hyperplane

gamma_i = (w * x_i + b) * y_i

we want to maximize the margin for each data point (optimization problem)

55
Q

SVM what if the data is not separable

A

introduce a penalty: if point x_i is on the wrong side of the margin then get penalty ksi_i, which is the distance of x_i to the closest point on the right side of the line

minimize |w|^2 plus the number of training mistakes (slack penalty C) times ksi

56
Q

SVM problem when optimizing with w

A

scaling w increases the margin, so optimizing to get the largest error would work by just maximizing w (w can be arbtrarily large)

solution: work with normalized w
=> gamma = (w/|w| * x + b) * y
max gamma = max 1/|w| = min|w| = min 1/2*|w|^2

57
Q

SVM hinge loss

A

if the classified point is too close to the separating line or on the wrong side, we incur a penalty proportional to how far away the point is from the decision boundary

58
Q

SVM how do we estimate w?

A

minimize f(w,b)
f(w,b) = 1/2|w|^2 + Cmax(0, 1 - (w * x + b) * y)

compute gradients with respect to w_j

59
Q

tensor

A

an array, can be of any dimension (single number, tuple, image, stack of images, etc.)

60
Q

CNN feature map

A

the result of applying a convolutional layer to the data

61
Q

padding settings

A

artificially increasing the size of the input to preserve the original input size in the feature map

“same”: add zeros when needed
“valid”: accept the loss of some input => no padding and ignore parts of the input that don’t fit because of the stride

62
Q

local response normalization LRN

A
63
Q

AlexNet

A

first to stack convolutional layers directly on top of one another without pooling layers inbetween

regularization techniques:
- dropout
- data augmentation: increase size of training set by generating many realistic variants of each training instance (eg. shift, rotate, resize)
- local response normalization

64
Q

local response normalization LRN

A

the most strongly activated neurons inhibit other neurons located at the same position in neighboring feature maps, encouraging different feature maps to specialize, which improves generalization