Neural networks Flashcards
Explain what is a linear sparse coding model and write loss function.
Sparse distributed coding: each entity is encoded by strong activation of a relatively small set of neurons.
Sparse coding: it imposes a penalty on the number of active units to define independency. We select a small number of active functions from a large dictionary, in such way that the basis function code is overcomplete. We define sparsity as having few non-zero components or few components do not close to zero. I.e., given an input vector we would like that only few coefficients are far from zero. This method is optimized iteratively.
Derive delta rule.
The delta rule is the first step to improve the perceptron and consists in replace the activation function with linear activation function, in this case continues and differentiable. Then we can use the gradient descent to minimize the loss function, obtaining the following formulas.
(see the formula)
Note that the perceptron change abruptly the inclination of the line, instead the delta rule changes it slightly and gradually modify the weights until the optimal ones are found.
Weights initializations.
In general, we can initialize the synaptic weights randomly sampling from a Gaussian N(0,1) or an uniform U(-1,1). This is done to break the symmetry: if two hidden units are connected to the same inputs, then their initial parameter should be different otherwise they will converge to the same value. With this method we can initialize larger value to break significantly the symmetry in the network, but due to the activation function saturation we can’t initialize them using too large values. We can improve the weight initialization using an heuristic that improves the SGD convergence.
(see the formula)
where m is the number of inputs (called fan-in) and n is the number of outputs (called fan-ut) of each layer. However, if many units are connected to a single neuron it might led to a small value of the weights. Then we can use the sparse initialization, where each unit has k non-zero weights (parameter decided a priori) and are drawn according to a distribution. In this way we can control the total amount of activation as we increase the number of units (m and n in the previous formula) without decreasing the weights’ value.
Regularization.
The model capacity is limited by imposing a norm penalty in the parameters
(see formula)
where the hyperparameters alpha defines the contribution of the penalty term to the overall loss function. Usually we regularize only the weights. With different types of regularization we can obtain different effects on the network: with L2 norm we change the weight decay (ridge regression, or with L1 norm we can implement the coefficient sparsity (Lasso regression).
Dropout.
The dropout consists in remove input or hidden units during the processing each pattern. The goal is to regularize each units in order to learn not only a feature, but learn a feature that is good in many contexts. The simpler implementation consists in multiply by 0 the output of the activation function according to a binary mask obtained randomly sampled the network structure with a probability p, that is an hyperparameter.
Momentum.
The momentum helps to speed up the convergence of the gradient descent, especially if the Hessian matrix is poorly conditioned (i.e. the loss function has long ellipsoids as level curves). We introduce a velocity term v, that contains both the direction and the speed of the motion in the optimization space.
(see formulas)
It is measured using an exponentially decaying moving average of past gradients and adding it to the current estimate. The term v is multiplied by an hyperparameter alpha, that has to be adjusted. The informally idea is to see the optimization space as a valley, where we go faster if we are moving down.
How Adam works? Write the equations.
Adam is an adaptive learning rates method, that computes individual adaptive learning rate for different parameters from the estimates of first and second moments of the gradients.
(see formulas)
where eta is the learning rate.
Why are second order optimization methods challenging? What is the learning goal? Condition number of the hessian matrix.
The second-order derivatives give more information about the cost function than the one given by the first-order. In fact, if in the second-order there is a negative curvature, then the cost function decreases faster than the gradient predicts, and if there is a positive curvature, then the cost function decreases slower than predicted by the gradient and eventually begins to increase. Instead, if there isn’t any curvature the gradient predicts correctly. We have to compute the Hessian matrix, that is essentially the Jacobian of the gradient.
For the Hessian matrix we can define the condition number as the ratio between its largest eigenvalue and its smallest one. If this ratio is very large, then the GD performs poorly, because the gradient increases rapidly in one direction and slowly in the others one. Then, the learning goal is to guide the SGD or GD using the information in the Hessian matrix.
Kernels in CNN.
The kernel of a CNN is a specific feature encoded by a local receptive field, that is defined for each hidden neuron. However, each kernel is convolved with the entire image like we have multiple copies of the same neuron with the same weights, then we analyse different portions of the image using the same parameters. In this case the output is not a 1 dimensional element but a 2 dimensional one, because each layer produces a filtered version of its input. We can define the kernel stride as the step by which the kernel moves. In order to analyse more accurately a portion of image or don’t decrease the output size we can add a padding of zeros around the input. Different kernels generate different outputs for the same input. If the input has multiple channels we can define a kernel for each one and analyse them separately.
Explain adversarial attacks and defence.
Adversarial attacks consist in inject a small percentage of noise (in some cases not distinguishable from human eyes) in the input in the direction of the gradient that maximizes the loss function. In contrast than the GD, we change the input in order to fool the NN. In fact during the training phase the model track the boundaries for the classification task, and the adversarial samples try to move towards this boundaries and, in some cases, trespassing them in order to correct the actual boundaries to the original data ones. To defend against this type of attacks we can use the adversarial training (optimization via min-max objective during systematically attacks) or the randomized smoothing (smooth the network weights convolving them with a Gaussian noise).
Receptive fields.
The receptive field of a neuron is the size of the region in input that process a feature. It measures the correlation between the output feature and the input region.
Explain back propagation in time.
The recurrent neural networks can be seen, exploiting the time unfolding, as deep network with parameter sharing. Then we can break the sequence of layers in smaller pieces and compute the gradient in these ones rather than compute it in the whole network at the end. In practice, we can process the first time step and create the first hidden activations, then we process the next time step and, using the previous context values, create the second hidden activations, and so on until we reach the last state that encodes the entire sequence. In this way we also avoid the gradient vanishing in the single parts.
LSTM diagram, how gate works, their equation and why you need both sigmoid and tanh.
(see figure)
The LSTM goal is to selectively block, add or remove information in the memory of the network. To do that, we can implement a system of gates that filters the input information:
o The input gate, that decides if the input will flow in the memory cell or it will be dropped; (see formulas)
o The forget gate, that takes in input the previous hidden state and decides if it will be remembered or it will be dropped; (see formulas)
o The output gate, that decides if the current value (input combined with the previous state) will be dropped or it will flow in the next state. (see formulas)
The cell state equation is:
(see formula)
and serves to propagate the information over the time.
The sigmoid is used to decide if the information will be dropped or will flow, instead the tanh decides if add or remove the information in the cell state.
Transformers and attention mechanism.
The attention method consists in an encoder-decoder framework, where the encoder passes at each time step all its hidden states to the decoder and the decoder gives at each one a score that amplifies the most relevant parts. Then at each time step it is generated a context vector: prepare the inputs, give a score to each one, apply a softmax to the input and multiply each result by the correspondent score, sum all weighted input. Then the decoder takes in input this representation. In this way the network can learn the sentence structure giving importance to specific parts.
The transformers use the attention mechanism in a different way. The encoder takes in input and the input, compute an encoding and combine it to a positional encoding (try to incorporate in a finite size vector the positional information), then it applies the multi head attention (attention for multiple parts), then use a feed forward to process all input elements at the same time and, at the end, passes its hidden states to the decoder. Instead, the decoder takes in input the previous output, compute an embedding and combine it with a positional embedding, then apply the muti head attention to it and combine the result with the hidden state passed by the encoder and combine both with a feed forward. The decoder produces sequentially the output and passes to the next state this output “shifted to right”, to embed the passage of the time.
Universal approximation properties of ANN: specifically, cases of RNN and RBM.
The RNN universal approximation properties are:
o For any computable function there exists a finite RNN that computes it.
o Any finite time trajectory of a given n-dimensional dynamical system can be approximately realized by the state of the output units of a continuous time RNN with n output units, some hidden units and appropriate initial condition;
o Any open dynamical system can be approximated with an arbitrary accuracy by a RNN defined in state space model form.
The RBM universal approximation properties are:
o It can be proved that RBM are universal approximators of probability mass function over discrete variables;
o Temporal extensions of RBM are universal approximators of stochastic process with a finite time dependence.
Overcomplete code in autoencoder.
Autoencoders are deterministic models trained using error propagation, which aim to reconstruct the input pattern. To implement a overcomplete code we have to set the number of hidden units equal or grater to the number of visible ones. In this case the network learn how to copy the input, but given a new pattern not present in the training in input the network could perform bad. Instead under complete code is implemented setting the number of hidden units lower than the number of visible ones and, in this case, the network can learn a set of features useful to copy the input pattern.
Gibbs sampling, Markov blanket.
A Markov network is an undirected graph model where edges represent the affinity relations between nodes and to each one is associated a facto that indicates the correlation strength between the connected units. In this model we can define the Markov blanket of a node as set of its immediate neighbours. In this way we make the node independent from the all nodes not present in the blanket. Using the Markov chain method we can start generating an arbitrary sample and then resampling all other unobserved variable iteratively. Rather than resampling the whole network, we can sample only one neuron and then at each iteration condition all the other one. To speed up this process we can use the Gibbs sampling, that consists in ignoring all variables which are outside from the Markov blanket of the node and sample at the same time all variables which are conditionally independent.
Energy of a Hopfield network.
An Hopfield network is a undirected fully connected graph, where the neurons are all visible and each one corresponds to a input units. In this architecture there aren’t self loops. The energy function that drives the model is: (see formula)
Talk about Hopfield networks (training).
The attractors correspond to the local minima of the energy, and to minimize it we can infer on the neurons’ activation or change the weights during the training to create attractors in correspondence of training patterns. The neurons’ value are computed using a weighted sum of all other and filtering it using an hard threshold. The update phase can be done asynchronously or synchronously. If the weights have proper values the activations will always converge to a stable state, where the energy is minimized.
How to improve sampling using temperature (simulated annealing)? Activation function in Boltzmann machine including temperature parameter.
The Boltzmann machines use the stochastic function defined for stochastic Hopfield model as activation one for their binary units. In this case we can speed up the convergence applying a data inference (visible units have value by an input pattern clamped and hidden units are initialized at random and then apply Gibbs sampling) or model driven (start with random activations for all units and then apply Gibbs sampling) and combine one of these two methods with simulated annealing to find a good value for the hyperparameter T.
Simulated annealing (with a formula example, stochastic Hopfield with the meaning of the T parameter).
In Hopfield networks we can fall in local minima. To avoid that we can replace the deterministic function with a stochastic one: (see formula)
The hyperparameter T is called temperature and serves to control the curvature of the sigmoid. That implements the concept of how much the dynamics is noisy: for low values of T the network behaves as in the deterministic case, for medium values the network will tend toward to low energy but occasionally jump to states with high energy, and for high values the network will randomly explore the space. The simulated annealing method consists in start the training with high value of T and gradually decrement it until the system becomes deterministic.
RBM/BM difference.
A Boltzmann machine is a stochastic variant of the Hopfield network that exploits hidden units to learn high-order correlations from the data distribution. Then we have both visible units and hidden ones and this model is fully connected. The learning phase is performed using the maximum-likelihood.
(see formulas)
Instead the Restricted Boltzmann machine is bipartite graph, that is a graph where units of the same layer aren’t connected to each other. In this case the model encodes conditional independencies, and we can infer the same state of units in the same layer in one single step in parallel. The learning phase is performed by the contrastive divergence.
(see formulas)
Gibbs sampling and Markov blanket in RBM.
Since RBM is a bipartite graph, the hidden units are connected only with the input one and not with each other. Then the Markov blanket of a hidden node in RBM is composed by only input units, then using Gibbs sampling we can infer in parallel the same input value for all hidden units.
Define Energy, Probability of a RBM.
(see formulas)