Deep Learning Flashcards

1
Q

data for convolutional networks

A

grid-like topology (1D time series and 2D images)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

distinguishing feature of convolutional networks

A

CNNs use convolution (and not matrix multiplication) in some layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

convolution function

A

integral of the product two functions (after one is reversed and shifted)

(f * g)(t) = ∫ f(a)g(t-a) da

think of f as a measurement and g as a weighting function that values the most recent measuremnts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

parts of convolution

A

main function: input, n-dimension array of data

weighting function: kernel, n-dimension array of parameters to adjust

output: feature map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

computational features of convolutional networks

A
  1. sparse interactions - kernal usually much smaller than input
  2. tied weights - same set of weights applied throughout the input
  3. equivariant to translation - convolution will give same result if input is translated. An event detector on a time series will find same event if it’s moved.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

stacked convolutional layers

A

receptive fields of deep units is larger (but also indirect) compared receptive field of shallower units

if layer 2 has a kernel width of 3, then each hidden unit receives input from 3 units.

if layer 3 also has a kernal width of 3, then these hidden units here receive indirect input from 9 inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

stages of a convolutional layer

A
  1. Convolution stage: convolution to get linear activation function
  2. Detector stage: Nonlinear function on linear activations
  3. Pooling stage: Replace output at some location with a summary statistic of nearby units
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

pooling and translation

A

small changes in location won’t make big changes to the summary statistics in the regions that are pooled together

pooling makes network invariant to small translations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what convolution hard codes

A

the concept of a topology

(non convolutional models would have to discover the topology during learning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

local connection (as opposed to convolution)

A

like a convolution with a kernel width (patch size) of n, except with no parameter sharing.

each unit has a receptive field of n, but the incoming weights don’t have be the same in every receptive field.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

iterated pixel labelling

A

suppose convolution step provides a label for a pixel. repeatedly applying the convolution on the labels creates a recurrent convolutional network.

repeated convolutional layers with shared weights across layers is a kind of recurrent network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

why convolutional networks can handle different input sizes

A

each convolution step scales the input. if you repeat the convolution an appropriate number of times, you can normalize the size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

convolutions for 2D audio

A

convolutions over time: invariant to shifts in time

convolutions over frequency: invariant to changes in frequency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

primary visual cortex

A
  • V1 has a 2D structure matching the 2D structure of retinal image
  • Simple cells inspired detectors in CNNs and respond to features in small localized receptive fields
  • Complex cells inspired pooling units. They also respond to features but are invariant to small changes in input position.
  • Inferotemporal cortex responds like last layer in CNN
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

differences between human vision and convolutional networks

A
  • Human vision is low resolution outside of fovea. CNNs have full-resolution over whole image
  • Vision integrates with other senses
  • Top down processing happens in human system
  • Human neurons likely have different activation and pooling functions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

regularization

A
  • Modifications to training regime to prevent overfitting
  • Increasing training error for reduced testing error
  • Trading increased bias for reduced variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

dataset augmentation strategies

A
  • Adding transformations to training input (e.g., translating images a few pixels)
  • Adding random noise to input data
    • Model needs to find regions insensitive to small perturbations
    • Not just a local minima but a local plateau
  • Adversarial training
    • Create inputs that the network will probably misclassify
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

noise robustness

A
  • Adding noise to hidden units or weights
    • Noise on weights captures uncertainty about parameter estimates
  • Adding noise to output units
    • Assume x% of labels are wrong, so model doesn’t overfit on bad training data
19
Q

semi-supervised training

A
  • Use both labeled P(x,y) and unlabeled P(x) data to estimate P(y|x)
  • Want to learn a latent representation
  • Have the generative model share representations parameters with the discriminative model
  • Like having a prior that the structure of P(x) is connect to structure of P(y|x)
20
Q

multi-task learning

A
  • Have the model do different kinds of tasks
  • Assume there is a set of factors that account for variance in input and that these factors are shared by different tasks. (Each task uses a subset of these factors.)
  • Shared part of model should have good values bc they can generalize across tasks
  • Common architecture
    • Input layer
    • Shared representation layers
    • Task specific representation layers
    • Output layers
21
Q

early stopping

A
  • As you overfit a model, training error continues to decrease and testing error starts to rise.
  • So stop just stop at the test data’s local minimum
  • Think of the number of training steps as a hyperparameter,
    • Similar to L2 regularization, but instead of training several models to find optimal L2 value, we learn the optimal number of steps during training.
  • Requires extra data for testing.
    • Alternatively, remember number of steps to minima and retrain on training+test data but stop after the optimal number of steps
  • Restricts model space to a smaller volume of parameter space
    • If learning rate is R, can only explore R*n_steps of space
22
Q

parameter tying and sharing

A
  • Sharing: Force groups of parameters within a model to be equal
  • Tying: Force parameters to be like parameters in another model
23
Q

sparse representations (regularization)

A
  • Penalize activation of the hidden units
  • Many of the elements of the (hidden) representation are zero-ish.
24
Q

bagging

A
  • Bootstrap aggregating
  • Bootstrap sample k new training datasets and train k new models
  • On average two thirds of data set will be in each new data set
  • 5-10 in an ensemble
  • Hard to train so many large networks
25
Q

Why can NNs benefit from averaging?

A
  • They reach many different solutions
  • Differences in
    • Random initialization,
    • Random train batch selection,
    • Hyperparameters
    • Outcomes of non-determinism
26
Q

ensemble methods

A
  • Combining several models
  • Train several models separately and let them vote on the output
27
Q

model averaging

A
  • Works because not all models will make the same errors at test time
  • If errors are correlated perfectly, no advantage to averaging
  • If errors uncorrelated perfectly, expected squared error reduces linearly with ensemble size
  • On average, ensemble will perform at least well as any of its members and if errors are independent, the ensemble will perform better than its members.
  • Usually the key to winning ML competitions
28
Q

drop out

A
  • Stochastically disable/mask hidden units
  • Hidden units cannot co-adapt/conspire with each other
  • Hidden units must be more generally useful, encode better features.
  • Have to include a mask vector into backpropagation algorithm math
  • At test time, use a constant vector of expectation instead of binary mask vector. If you have .5 probabilty of dropping unit, instead just weight it by .5
  • Analogy to sexual reproduction: Half of genes from each parent promotes genes that are robust.
29
Q

norm penalties

A
  • Add a penalty to objective function based on magnitude of paramters
  • L1 regularization
    • Penalty on sum of absolute values of parameters
    • Incudes sparse parameterization
  • L2 regularization
    • Penalty on sum of squared parameters
    • Large weights get extra punishment
30
Q

autoencoders

A
  • Feedforward networks trained to reproduce its input.
  • Supervised approach to unlabeled data.
  • Encoder function: h = f(x), or stochastic p(h|x)
  • Reconstruction: r = g(h), or stochastic p(x|h)
31
Q

undercomplete autoencoders

A
  • Smaller hidden layer than visible layer, so it learns salient features of input (in training distribution).
  • Lossy compression.
32
Q

denoising autoencoders

A
  • Add noise layer (Input -> Noise Process p(x’|x) -> Hidden -> Reconstruction)
    • Noise could be Gaussian additive noise or randomly zeroing input units
  • Minimize reconstruction error like other autoencoders, but here noise is added to input. Goal is reconstruct uncorrupted version of input.
  • Enlarges receptive fields of hidden unit.
    • Uses more information from elsewhere in input to reconstruct output.
  • Learn a good internal representation as a consequence of learning to denoise
33
Q

contractive autoencoders

A
  • We want to extract features that reflect variations in the training input
  • Add new term to loss function reflecting Jacobian of encoder
    • One term keeps reconstructive info
    • New term throws all information
    • Satisfying both means we have just the good reconstructive features.
  • Minimizing Jacobian minimizes partial derivatives of encoder. Smaller derivatives means encoder will change less with changes in input.
34
Q

Why are restricted Boltzmann machines “restricted”?

A

No lateral connections among visible units or among hidden units.

35
Q

goal of restricted Boltzmann machine

A

Model distribution over visible units x in iterms of hidden units h.

36
Q

basics of energy function

A
  • Negated sum of products of weights and units
    • -hWx -bias*h -bias*x
  • Positive weights and active units increase energy
  • High energy means low probabilty
    • p(x) = exp(-ENERGY(x))
    • exp(anything) is always positive, so there are no zero probability states
  • Network can settle into an equilibrium or stationary distribution
37
Q

softplus

A

f(x) = log(1 + exp(x))

smoothed version of rectified linear unit

38
Q

rectified linear unit function

A

f(x) = max(0, x)

One-sided activation function

39
Q

hidden units in restricted Boltzmann machines

A
  • Each hidden unit is conditionally independent of each other given an input.
  • p(h | x) factorizes into product of each hidden unit activating, reduce(*, p(h_i | x))
  • Learning rule works locally
    • Only information about x_i and h_j needed to update Wij
    • Biologically plausible
40
Q

stochastic nature of restriced Boltzmann machines

A
  • Unit active with probability related to sigmoid function of inputs
  • If units are stochastic, then repeated top-down passes reveal distribution of sensory inputs that the model believes in.
    • Fantasies in network’s thermal equilibrium show inputs network can generate
  • There are many ways to generate the observed data, so need to learn a probability distribution of idden variables.
41
Q

sleep-wake algorithm

A

Alternating Gibbs sampling to approximate sampling from joint distribution p(v,h)

  1. Start with some random input
  2. Update hidden units in parallel
  3. Update visible units
  4. Repeat until equilibrium
42
Q

weight learning procedure in restricted Boltzmann machines

A
  • Clamp input x (input)
  • Observe which h activate
  • Clamp activated h units
  • Observe x activated (reconstruction)
  • Update based on pairwise correlation of h and x units
    • freq_diff: <xihj> of data - <xihj> of reconstruction</xihj></xihj>
    • <xihj> is frequency that feature j and visible unit i are both on together</xihj>
    • Update weight wij by freq_diff(i,j) * learning rate.
    • Hebbian style learning
43
Q

sparse coding

A
  • Objective function has reconstruction error and L1 regularization term to induce sparseness
  • Reconstruction is product of dictionary matrix (weights) and hidden units
  • Great at feature extraction for other algorithms
  • Unsupervised learning
44
Q

relationship between V1 and sparse coding

A
  • Sparse coding algorithm trained on patches of images will extract features that are like V1 receptive fields
  • Edge detectors at different positions, orientations, and spatial frequency
  • Olshausen and Field, 1996