Neural networks Flashcards by Luca Scaramuzza

Explain what is a linear sparse coding model and write loss function.

Sparse distributed coding: each entity is encoded by strong activation of a relatively small set of neurons.
Sparse coding: it imposes a penalty on the number of active units to define independency. We select a small number of active functions from a large dictionary, in such way that the basis function code is overcomplete. We define sparsity as having few non-zero components or few components do not close to zero. I.e., given an input vector we would like that only few coefficients are far from zero. This method is optimized iteratively.

How well did you know this?

Not at all

Perfectly

Derive delta rule.

The delta rule is the first step to improve the perceptron and consists in replace the activation function with linear activation function, in this case continues and differentiable. Then we can use the gradient descent to minimize the loss function, obtaining the following formulas.
(see the formula)
Note that the perceptron change abruptly the inclination of the line, instead the delta rule changes it slightly and gradually modify the weights until the optimal ones are found.

How well did you know this?

Not at all

Perfectly

Weights initializations.

In general, we can initialize the synaptic weights randomly sampling from a Gaussian N(0,1) or an uniform U(-1,1). This is done to break the symmetry: if two hidden units are connected to the same inputs, then their initial parameter should be different otherwise they will converge to the same value. With this method we can initialize larger value to break significantly the symmetry in the network, but due to the activation function saturation we can’t initialize them using too large values. We can improve the weight initialization using an heuristic that improves the SGD convergence.
(see the formula)
where m is the number of inputs (called fan-in) and n is the number of outputs (called fan-ut) of each layer. However, if many units are connected to a single neuron it might led to a small value of the weights. Then we can use the sparse initialization, where each unit has k non-zero weights (parameter decided a priori) and are drawn according to a distribution. In this way we can control the total amount of activation as we increase the number of units (m and n in the previous formula) without decreasing the weights’ value.

How well did you know this?

Not at all

Perfectly

Regularization.

The model capacity is limited by imposing a norm penalty in the parameters
(see formula)
where the hyperparameters alpha defines the contribution of the penalty term to the overall loss function. Usually we regularize only the weights. With different types of regularization we can obtain different effects on the network: with L2 norm we change the weight decay (ridge regression, or with L1 norm we can implement the coefficient sparsity (Lasso regression).

How well did you know this?

Not at all

Perfectly

Dropout.

The dropout consists in remove input or hidden units during the processing each pattern. The goal is to regularize each units in order to learn not only a feature, but learn a feature that is good in many contexts. The simpler implementation consists in multiply by 0 the output of the activation function according to a binary mask obtained randomly sampled the network structure with a probability p, that is an hyperparameter.

How well did you know this?

Not at all

Perfectly

Momentum.

The momentum helps to speed up the convergence of the gradient descent, especially if the Hessian matrix is poorly conditioned (i.e. the loss function has long ellipsoids as level curves). We introduce a velocity term v, that contains both the direction and the speed of the motion in the optimization space.
(see formulas)
It is measured using an exponentially decaying moving average of past gradients and adding it to the current estimate. The term v is multiplied by an hyperparameter alpha, that has to be adjusted. The informally idea is to see the optimization space as a valley, where we go faster if we are moving down.

How well did you know this?

Not at all

Perfectly

How Adam works? Write the equations.

Adam is an adaptive learning rates method, that computes individual adaptive learning rate for different parameters from the estimates of first and second moments of the gradients.
(see formulas)
where eta is the learning rate.

How well did you know this?

Not at all

Perfectly

Why are second order optimization methods challenging? What is the learning goal? Condition number of the hessian matrix.

The second-order derivatives give more information about the cost function than the one given by the first-order. In fact, if in the second-order there is a negative curvature, then the cost function decreases faster than the gradient predicts, and if there is a positive curvature, then the cost function decreases slower than predicted by the gradient and eventually begins to increase. Instead, if there isn’t any curvature the gradient predicts correctly. We have to compute the Hessian matrix, that is essentially the Jacobian of the gradient.
For the Hessian matrix we can define the condition number as the ratio between its largest eigenvalue and its smallest one. If this ratio is very large, then the GD performs poorly, because the gradient increases rapidly in one direction and slowly in the others one. Then, the learning goal is to guide the SGD or GD using the information in the Hessian matrix.

How well did you know this?

Not at all

Perfectly

Kernels in CNN.

The kernel of a CNN is a specific feature encoded by a local receptive field, that is defined for each hidden neuron. However, each kernel is convolved with the entire image like we have multiple copies of the same neuron with the same weights, then we analyse different portions of the image using the same parameters. In this case the output is not a 1 dimensional element but a 2 dimensional one, because each layer produces a filtered version of its input. We can define the kernel stride as the step by which the kernel moves. In order to analyse more accurately a portion of image or don’t decrease the output size we can add a padding of zeros around the input. Different kernels generate different outputs for the same input. If the input has multiple channels we can define a kernel for each one and analyse them separately.

How well did you know this?

Not at all

Perfectly

Explain adversarial attacks and defence.

Adversarial attacks consist in inject a small percentage of noise (in some cases not distinguishable from human eyes) in the input in the direction of the gradient that maximizes the loss function. In contrast than the GD, we change the input in order to fool the NN. In fact during the training phase the model track the boundaries for the classification task, and the adversarial samples try to move towards this boundaries and, in some cases, trespassing them in order to correct the actual boundaries to the original data ones. To defend against this type of attacks we can use the adversarial training (optimization via min-max objective during systematically attacks) or the randomized smoothing (smooth the network weights convolving them with a Gaussian noise).

How well did you know this?

Not at all

Perfectly

Receptive fields.

The receptive field of a neuron is the size of the region in input that process a feature. It measures the correlation between the output feature and the input region.

How well did you know this?

Not at all

Perfectly

Explain back propagation in time.

The recurrent neural networks can be seen, exploiting the time unfolding, as deep network with parameter sharing. Then we can break the sequence of layers in smaller pieces and compute the gradient in these ones rather than compute it in the whole network at the end. In practice, we can process the first time step and create the first hidden activations, then we process the next time step and, using the previous context values, create the second hidden activations, and so on until we reach the last state that encodes the entire sequence. In this way we also avoid the gradient vanishing in the single parts.

How well did you know this?

Not at all

Perfectly

LSTM diagram, how gate works, their equation and why you need both sigmoid and tanh.

(see figure)
The LSTM goal is to selectively block, add or remove information in the memory of the network. To do that, we can implement a system of gates that filters the input information:
o The input gate, that decides if the input will flow in the memory cell or it will be dropped; (see formulas)

o The forget gate, that takes in input the previous hidden state and decides if it will be remembered or it will be dropped; (see formulas)

o The output gate, that decides if the current value (input combined with the previous state) will be dropped or it will flow in the next state. (see formulas)

The cell state equation is:
(see formula)
and serves to propagate the information over the time.
The sigmoid is used to decide if the information will be dropped or will flow, instead the tanh decides if add or remove the information in the cell state.

How well did you know this?

Not at all

Perfectly

Transformers and attention mechanism.

The attention method consists in an encoder-decoder framework, where the encoder passes at each time step all its hidden states to the decoder and the decoder gives at each one a score that amplifies the most relevant parts. Then at each time step it is generated a context vector: prepare the inputs, give a score to each one, apply a softmax to the input and multiply each result by the correspondent score, sum all weighted input. Then the decoder takes in input this representation. In this way the network can learn the sentence structure giving importance to specific parts.
The transformers use the attention mechanism in a different way. The encoder takes in input and the input, compute an encoding and combine it to a positional encoding (try to incorporate in a finite size vector the positional information), then it applies the multi head attention (attention for multiple parts), then use a feed forward to process all input elements at the same time and, at the end, passes its hidden states to the decoder. Instead, the decoder takes in input the previous output, compute an embedding and combine it with a positional embedding, then apply the muti head attention to it and combine the result with the hidden state passed by the encoder and combine both with a feed forward. The decoder produces sequentially the output and passes to the next state this output “shifted to right”, to embed the passage of the time.

How well did you know this?

Not at all

Perfectly

Universal approximation properties of ANN: specifically, cases of RNN and RBM.

The RNN universal approximation properties are:
o For any computable function there exists a finite RNN that computes it.
o Any finite time trajectory of a given n-dimensional dynamical system can be approximately realized by the state of the output units of a continuous time RNN with n output units, some hidden units and appropriate initial condition;
o Any open dynamical system can be approximated with an arbitrary accuracy by a RNN defined in state space model form.
The RBM universal approximation properties are:
o It can be proved that RBM are universal approximators of probability mass function over discrete variables;
o Temporal extensions of RBM are universal approximators of stochastic process with a finite time dependence.

How well did you know this?

Not at all

Perfectly

Overcomplete code in autoencoder.

Study These Flashcards

Autoencoders are deterministic models trained using error propagation, which aim to reconstruct the input pattern. To implement a overcomplete code we have to set the number of hidden units equal or grater to the number of visible ones. In this case the network learn how to copy the input, but given a new pattern not present in the training in input the network could perform bad. Instead under complete code is implemented setting the number of hidden units lower than the number of visible ones and, in this case, the network can learn a set of features useful to copy the input pattern.

Gibbs sampling, Markov blanket.

Study These Flashcards

A Markov network is an undirected graph model where edges represent the affinity relations between nodes and to each one is associated a facto that indicates the correlation strength between the connected units. In this model we can define the Markov blanket of a node as set of its immediate neighbours. In this way we make the node independent from the all nodes not present in the blanket. Using the Markov chain method we can start generating an arbitrary sample and then resampling all other unobserved variable iteratively. Rather than resampling the whole network, we can sample only one neuron and then at each iteration condition all the other one. To speed up this process we can use the Gibbs sampling, that consists in ignoring all variables which are outside from the Markov blanket of the node and sample at the same time all variables which are conditionally independent.

Energy of a Hopfield network.

Study These Flashcards

An Hopfield network is a undirected fully connected graph, where the neurons are all visible and each one corresponds to a input units. In this architecture there aren’t self loops. The energy function that drives the model is: (see formula)

Talk about Hopfield networks (training).

Study These Flashcards

The attractors correspond to the local minima of the energy, and to minimize it we can infer on the neurons’ activation or change the weights during the training to create attractors in correspondence of training patterns. The neurons’ value are computed using a weighted sum of all other and filtering it using an hard threshold. The update phase can be done asynchronously or synchronously. If the weights have proper values the activations will always converge to a stable state, where the energy is minimized.

How to improve sampling using temperature (simulated annealing)? Activation function in Boltzmann machine including temperature parameter.

Study These Flashcards

The Boltzmann machines use the stochastic function defined for stochastic Hopfield model as activation one for their binary units. In this case we can speed up the convergence applying a data inference (visible units have value by an input pattern clamped and hidden units are initialized at random and then apply Gibbs sampling) or model driven (start with random activations for all units and then apply Gibbs sampling) and combine one of these two methods with simulated annealing to find a good value for the hyperparameter T.

Simulated annealing (with a formula example, stochastic Hopfield with the meaning of the T parameter).

Study These Flashcards

In Hopfield networks we can fall in local minima. To avoid that we can replace the deterministic function with a stochastic one: (see formula)
The hyperparameter T is called temperature and serves to control the curvature of the sigmoid. That implements the concept of how much the dynamics is noisy: for low values of T the network behaves as in the deterministic case, for medium values the network will tend toward to low energy but occasionally jump to states with high energy, and for high values the network will randomly explore the space. The simulated annealing method consists in start the training with high value of T and gradually decrement it until the system becomes deterministic.

RBM/BM difference.

Study These Flashcards

A Boltzmann machine is a stochastic variant of the Hopfield network that exploits hidden units to learn high-order correlations from the data distribution. Then we have both visible units and hidden ones and this model is fully connected. The learning phase is performed using the maximum-likelihood.
(see formulas)
Instead the Restricted Boltzmann machine is bipartite graph, that is a graph where units of the same layer aren’t connected to each other. In this case the model encodes conditional independencies, and we can infer the same state of units in the same layer in one single step in parallel. The learning phase is performed by the contrastive divergence.
(see formulas)

Gibbs sampling and Markov blanket in RBM.

Study These Flashcards

Since RBM is a bipartite graph, the hidden units are connected only with the input one and not with each other. Then the Markov blanket of a hidden node in RBM is composed by only input units, then using Gibbs sampling we can infer in parallel the same input value for all hidden units.

Define Energy, Probability of a RBM.

Study These Flashcards

(see formulas)

How contrastive divergence algorithm works?

The main idea of the contrastive divergence algorithm is to minimize the discrepancy between the empirical data distribution and the model one. Since compute the expectations is computationally demanding, we can approximate them using the reconstruction of the data produced in the model and stop at the first iteration (because we achieve good results). The algorithm is composed by three phases: o Positive phase: pattern clamped on visible neurons, compute the activations of the hidden units in a single step using stochastic activation function, compute correlations between visible units and hidden ones; o Negative phase: starting from the activations computed in the previous phase, generate activations on visible layer using the stochastic activation function, from these new activations compute again the hidden units activations, and then compute again the correlations between visible and hidden; o Weight updates: (see formulas) where eta is the learning rate.

What is a deep belief network and how can we train it?

A deep belief network is implemented training a RBM and use the first hidden representation as input for a second RBM, and so go on. This model can be interpreted as encoder of hypothesis over previous encoded hypothesis. Each layer performs a non-linear activation. Eventually we can add to the top of the network a layer to perform a supervised task or apply a fine-tuning.

Write the loss function for a variational autoencoders.

(see figure)

How can we use error back propagation in a variational autoencoders since sampling is discrete?

(see figure) Since the sampling operation is a discrete process we cannot apply back propagation directly. Then we have to reparametrize the encoders output multiplying it with a Normal distribution.

Beta VAE.

Beta VAE is a modified version of VAE to promote the learning of more disentangle representations, which encode, in some cases, latent factors of variation in the data distribution. The goal is to have single latent units z sensitive to changes in single generative factor, while being relative invariant to changes in other factors. The basic idea is to introduce a penalization term in the KL divergence using an hyperparameter beta greater than 1 that balances latent channel capacity and independence constraints with reconstruction accuracy. (see formula)

Loss function of the GAN. Explain each term, reaching of saddle points and subsequent effects on discriminator output.

The learning dynamics of the GAN consists in a min max 2 player game (see formula) where D(x) represents the probability that x comes from the true data distribution rather than the generator and G(z) is the output of the generator. The discriminator goal is to maximize the probability to assign the correct label to both x and G(z), instead the generator one is to minimize the probability that the discriminator assign the correct label to G(z). Then the parameters of the discriminator are updated performing the gradient ascending (see formula) instead the parameter of the generator are updated performing the gradient descending (see formula) There might be convergence problems. In fact, they should be fall in a saddle point, where we reach the minimum value for the generator and the maximum value for the discriminator. Then, we set as equilibrium state when the generator will generates sample which are indistinguishable for the discriminator. A possible conclusion for the game is that the generator will specialize in the generation of only few patterns which are indistinguishable for the discriminator at the expense of the other possible patterns. In this case we can avoid the problem sampling the data in different ways.

All types of GANs (InfoGAN, ...).

``` o Conditional GAN: include class information in the latent space to allow the generation of specific class of pattern. o Auxiliary classifier GAN: include class information also in the discriminator output to refine the classification task. o InfoGAN: add specific semantic meaning to some variables in the latent space (adding regularization term that maximizes the mutual information). o Wasserstein GAN: the discriminator return a score of realness about the input image, it converges faster. o Cycle GAN: learn image-to-image translation without paired images. ```

Explain how Temporal difference learning works. (Write down the delta equation). What is gamma?

The temporal difference learning aims to learn an approximation of the value function of the reinforcement learning model by exploiting a prediction error signal between two consecutive states. (see formula) here gamma is the discount rate belonging to (0,1] which quantifies how much we should care about future rewards, because our goal is to maximize the total rewards in the “agent life”. After the transitioning to the next state the function is updated to bring the true value (see formulas) where eta is the learning rate and delta is the prediction error.

Q-learning.

In Q-learning the agent associates a value Q to the pair state-action. (see formula) It allows to consider the policy used for predicting future rewards. Usually we start from a default estimate of the value function for each state-action pair and then we improve it over time. (see formula) This method is model free, because we don’t need any knowledge about the environment. We use Q value of the next state as estimate of future values. This method guarantees the convergence to the correct Q value if alpha respects some basic properties.

Exploration Policy.

The exploration is the phase of inspect new states and not remain in the known ones. We can choose the next action as the one that will maximize the future reward. However, in this way we could fall in local minima. Then we can apply different probabilistic exploration policies like epsilon greedy or softmax ones.

Epsilon greedy policy.

The epsilon greedy policy is based on the hyperparameter epsilon that is the probability to choose at random an action, instead with probability 1-epsilon we choose the greedy action. In this way we may not fall in a local minima.

Difference between on policy and off policy?

The update policy is greedy, because we choose the best action. However, we can apply a probabilistic policy to choose the next action. Then if the two policies don’t coincide the learning is called off policy, otherwise it is called on policy.

What are the 2 tricks to improve the stability of Deep Q learning.

The instability of the Q-learning is due to the fact the training set is being created incrementally and it contains highly correlated data and the function is nonstationary. The first trick to avoid this problem is to maintain 2 networks where the first one is the prediction network, it gives the prediction value Q(s,a,theta-i) in the equation and it is trained at each step; instead the second one is the target network and it is trained periodically to update the weights. The second solution consists in using a memory buffer to save the previous episodes, where the batches are composed by state, action, reward, next state which are randomly sampled. Then in this way we can decorrelate the training without loosing the time dependency.

What are policy gradient methods and where are they used?

The policy gradient methods try to estimate directly the action policy in addition to the value function. Then training occurs using backpropagation over both reward and action sampled from the neural network.

Neural networks Flashcards

(38 cards)