Deep Learning Flashcards

Question 1

Q

data for convolutional networks

Answer

A

grid-like topology (1D time series and 2D images)

Question 2

Q

distinguishing feature of convolutional networks

Answer

A

CNNs use convolution (and not matrix multiplication) in some layer

Question 3

Q

convolution function

Answer

A

integral of the product two functions (after one is reversed and shifted)

(f * g)(t) = ∫ f(a)g(t-a) da

think of f as a measurement and g as a weighting function that values the most recent measuremnts

Question 4

Q

parts of convolution

Answer

A

main function: input, n-dimension array of data

weighting function: kernel, n-dimension array of parameters to adjust

output: feature map

Question 5

Q

computational features of convolutional networks

Answer

A

sparse interactions - kernal usually much smaller than input
tied weights - same set of weights applied throughout the input
equivariant to translation - convolution will give same result if input is translated. An event detector on a time series will find same event if it’s moved.

Question 6

Q

stacked convolutional layers

Answer

A

receptive fields of deep units is larger (but also indirect) compared receptive field of shallower units

if layer 2 has a kernel width of 3, then each hidden unit receives input from 3 units.

if layer 3 also has a kernal width of 3, then these hidden units here receive indirect input from 9 inputs

Question 7

Q

stages of a convolutional layer

Answer

A

Convolution stage: convolution to get linear activation function
Detector stage: Nonlinear function on linear activations
Pooling stage: Replace output at some location with a summary statistic of nearby units

Question 8

Q

pooling and translation

Answer

A

small changes in location won’t make big changes to the summary statistics in the regions that are pooled together

pooling makes network invariant to small translations

Question 9

Q

what convolution hard codes

Answer

A

the concept of a topology

(non convolutional models would have to discover the topology during learning)

Question 10

Q

local connection (as opposed to convolution)

Answer

A

like a convolution with a kernel width (patch size) of n, except with no parameter sharing.

each unit has a receptive field of n, but the incoming weights don’t have be the same in every receptive field.

Question 11

Q

iterated pixel labelling

Answer

A

suppose convolution step provides a label for a pixel. repeatedly applying the convolution on the labels creates a recurrent convolutional network.

repeated convolutional layers with shared weights across layers is a kind of recurrent network.

Question 12

Q

why convolutional networks can handle different input sizes

Answer

A

each convolution step scales the input. if you repeat the convolution an appropriate number of times, you can normalize the size.

Question 13

Q

convolutions for 2D audio

Answer

A

convolutions over time: invariant to shifts in time

convolutions over frequency: invariant to changes in frequency.

Question 14

Q

primary visual cortex

Answer

A

V1 has a 2D structure matching the 2D structure of retinal image
Simple cells inspired detectors in CNNs and respond to features in small localized receptive fields
Complex cells inspired pooling units. They also respond to features but are invariant to small changes in input position.
Inferotemporal cortex responds like last layer in CNN

Question 15

Q

differences between human vision and convolutional networks

Answer

A

Human vision is low resolution outside of fovea. CNNs have full-resolution over whole image
Vision integrates with other senses
Top down processing happens in human system
Human neurons likely have different activation and pooling functions

Question 16

Q

regularization

Answer

A

Modifications to training regime to prevent overfitting
Increasing training error for reduced testing error
Trading increased bias for reduced variance

Question 17

Q

dataset augmentation strategies

Answer

A

Adding transformations to training input (e.g., translating images a few pixels)
Adding random noise to input data
- Model needs to find regions insensitive to small perturbations
- Not just a local minima but a local plateau
Adversarial training
- Create inputs that the network will probably misclassify

Question 18

Q

noise robustness

Answer

A

Adding noise to hidden units or weights
- Noise on weights captures uncertainty about parameter estimates
Adding noise to output units
- Assume x% of labels are wrong, so model doesn’t overfit on bad training data

Question 19

Q

semi-supervised training

Answer

A

Use both labeled P(x,y) and unlabeled P(x) data to estimate P(y|x)
Want to learn a latent representation
Have the generative model share representations parameters with the discriminative model
Like having a prior that the structure of P(x) is connect to structure of P(y|x)

Question 20

Q

multi-task learning

Answer

A

Have the model do different kinds of tasks
Assume there is a set of factors that account for variance in input and that these factors are shared by different tasks. (Each task uses a subset of these factors.)
Shared part of model should have good values bc they can generalize across tasks
Common architecture
- Input layer
- Shared representation layers
- Task specific representation layers
- Output layers

Question 21

Q

early stopping

Answer

A

As you overfit a model, training error continues to decrease and testing error starts to rise.
So stop just stop at the test data’s local minimum
Think of the number of training steps as a hyperparameter,
- Similar to L2 regularization, but instead of training several models to find optimal L2 value, we learn the optimal number of steps during training.
Requires extra data for testing.
- Alternatively, remember number of steps to minima and retrain on training+test data but stop after the optimal number of steps
Restricts model space to a smaller volume of parameter space
- If learning rate is R, can only explore R*n_steps of space

Question 22

Q

parameter tying and sharing

Answer

A

Sharing: Force groups of parameters within a model to be equal
Tying: Force parameters to be like parameters in another model

Question 23

Q

sparse representations (regularization)

Answer

A

Penalize activation of the hidden units
Many of the elements of the (hidden) representation are zero-ish.

Question 24

Q

bagging

Answer

A

Bootstrap aggregating
Bootstrap sample k new training datasets and train k new models
On average two thirds of data set will be in each new data set
5-10 in an ensemble
Hard to train so many large networks

Question 25

Q

Why can NNs benefit from averaging?

Answer

A

They reach many different solutions
Differences in
- Random initialization,
- Random train batch selection,
- Hyperparameters
- Outcomes of non-determinism

Question 26

Q

ensemble methods

Answer

A

Combining several models
Train several models separately and let them vote on the output

Question 27

Q

model averaging

Answer

A

Works because not all models will make the same errors at test time
If errors are correlated perfectly, no advantage to averaging
If errors uncorrelated perfectly, expected squared error reduces linearly with ensemble size
On average, ensemble will perform at least well as any of its members and if errors are independent, the ensemble will perform better than its members.
Usually the key to winning ML competitions

Question 28

Q

drop out

Answer

A

Stochastically disable/mask hidden units
Hidden units cannot co-adapt/conspire with each other
Hidden units must be more generally useful, encode better features.
Have to include a mask vector into backpropagation algorithm math
At test time, use a constant vector of expectation instead of binary mask vector. If you have .5 probabilty of dropping unit, instead just weight it by .5
Analogy to sexual reproduction: Half of genes from each parent promotes genes that are robust.

Question 29

Q

norm penalties

Answer

A

Add a penalty to objective function based on magnitude of paramters
L1 regularization
- Penalty on sum of absolute values of parameters
- Incudes sparse parameterization
L2 regularization
- Penalty on sum of squared parameters
- Large weights get extra punishment

Question 30

Q

autoencoders

Answer

A

Feedforward networks trained to reproduce its input.
Supervised approach to unlabeled data.
Encoder function: h = f(x), or stochastic p(h|x)
Reconstruction: r = g(h), or stochastic p(x|h)

Question 31

Q

undercomplete autoencoders

Answer

A

Smaller hidden layer than visible layer, so it learns salient features of input (in training distribution).
Lossy compression.

Question 32

Q

denoising autoencoders

Answer

A

Add noise layer (Input -> Noise Process p(x’|x) -> Hidden -> Reconstruction)
- Noise could be Gaussian additive noise or randomly zeroing input units
Minimize reconstruction error like other autoencoders, but here noise is added to input. Goal is reconstruct uncorrupted version of input.
Enlarges receptive fields of hidden unit.
- Uses more information from elsewhere in input to reconstruct output.
Learn a good internal representation as a consequence of learning to denoise

Question 33

Q

contractive autoencoders

Answer

A

We want to extract features that reflect variations in the training input
Add new term to loss function reflecting Jacobian of encoder
- One term keeps reconstructive info
- New term throws all information
- Satisfying both means we have just the good reconstructive features.
Minimizing Jacobian minimizes partial derivatives of encoder. Smaller derivatives means encoder will change less with changes in input.

Question 34

Q

Why are restricted Boltzmann machines “restricted”?

Answer

A

No lateral connections among visible units or among hidden units.

Question 35

Q

goal of restricted Boltzmann machine

Answer

A

Model distribution over visible units x in iterms of hidden units h.

Question 36

Q

basics of energy function

Answer

A

Negated sum of products of weights and units
- -hWx -bias*h -bias*x
Positive weights and active units increase energy
High energy means low probabilty
- p(x) = exp(-ENERGY(x))
- exp(anything) is always positive, so there are no zero probability states
Network can settle into an equilibrium or stationary distribution

Question 37

Q

softplus

Answer

A

f(x) = log(1 + exp(x))

smoothed version of rectified linear unit

Question 38

Q

rectified linear unit function

Answer

A

f(x) = max(0, x)

One-sided activation function

Question 39

Q

hidden units in restricted Boltzmann machines

Answer

A

Each hidden unit is conditionally independent of each other given an input.
p(h | x) factorizes into product of each hidden unit activating, reduce(*, p(h_i | x))
Learning rule works locally
- Only information about x_i and h_j needed to update Wij
- Biologically plausible

Question 40

Q

stochastic nature of restriced Boltzmann machines

Answer

A

Unit active with probability related to sigmoid function of inputs
If units are stochastic, then repeated top-down passes reveal distribution of sensory inputs that the model believes in.
- Fantasies in network’s thermal equilibrium show inputs network can generate
There are many ways to generate the observed data, so need to learn a probability distribution of idden variables.

Question 41

Q

sleep-wake algorithm

Answer

A

Alternating Gibbs sampling to approximate sampling from joint distribution p(v,h)

Start with some random input
Update hidden units in parallel
Update visible units
Repeat until equilibrium

Question 42

Q

weight learning procedure in restricted Boltzmann machines

Answer

A

Clamp input x (input)
Observe which h activate
Clamp activated h units
Observe x activated (reconstruction)
Update based on pairwise correlation of h and x units
- freq_diff: <xihj> of data - <xihj> of reconstruction</xihj></xihj>
- <xihj> is frequency that feature j and visible unit i are both on together</xihj>
- Update weight wij by freq_diff(i,j) * learning rate.
- Hebbian style learning

Question 43

Q

sparse coding

Answer

A

Objective function has reconstruction error and L1 regularization term to induce sparseness
Reconstruction is product of dictionary matrix (weights) and hidden units
Great at feature extraction for other algorithms
Unsupervised learning

Question 44

Q

relationship between V1 and sparse coding

Answer

A

Sparse coding algorithm trained on patches of images will extract features that are like V1 receptive fields
Edge detectors at different positions, orientations, and spatial frequency
Olshausen and Field, 1996

Brainscape's Knowledge GenomeTM

Deep Learning Flashcards

Brainscape's Knowledge Genome^TM