Topic 3: Deep Learning Basics & Optimisation & CNNs Flashcards
(27 cards)
What is a Multi-Layer Perceptron (MLP)?
A neural network consisting of multiple layers of neurons with nonlinear activation functions, allowing it to learn complex patterns and act as a universal function approximator.
multi layer perceptron:
- recursive recombination of binary classifier
- with arbitrary depth and size: universal function approximator - this function gives us the ability to use any arbitrary depth and size, but it must be differentiable
I = identity function
x_1 and x_2 are activating h_1 and h_2, and they get some bias.
Why are activation functions important in neural networks?
We use them to introduce some non-linearity into the neural network.
- they mimic the neural activation spike after membrane potential reaches threshold
- they’re decisive in the way that they can make a clear binary decision, shuld this neuron be activated
- they’re differentiable, meaning they can enable backpropagation
- they’re efficient to evaluate and derive
What are common activation functions used in deep learning?
Sigmoid, Tanh, and ReLU. ReLU is preferred in deep networks due to reduced vanishing gradient problems.
Sigmoid squashes values between 0 and 1 but suffers from vanishing gradients.
Tanh outputs values between -1 and 1 and has stronger gradients than sigmoid but still vanishes at extremes.
ReLU (Rectified Linear Unit) outputs 0 for negative inputs and identity for positive inputs, enabling sparse activation and mitigating vanishing gradient issues. It is preferred in deep networks.
What is backpropagation?
You train an MLP iteratively by updating θ.
we start by taking a batch of training data, then we do a forward pass where we calculate the corresponding loss L. then we do a backward pass where we propagate to get gradients. then we use the gradients to update the weights of the network, and we continue like so until we have reached convergence (or a low loss)
You make forward passes, and then you propagate backwards throughout the network and then you update the weights of the network and start from scratch with the new weights
What is a loss function?
The loss function is dependent on the activation function and the target you’re trying to represent. we use the gradients of the MLE. the loss function that’s often used is the cross-entropy.
We update the loop. we scale the gradients g by the learning rate, then we iterate until the loss is small enough and has converged as much as it can. an epoch is an iteration over ALL the data
What does the gradient represent in multivariable calculus?
The direction of the steepest ascent (uphill). The graident of a function represents the direction of the steeps ascent, as the vector points in the direction where f increases most rapidly, if it was descent, you would say -∇f (like in neural network) not, ∇f. The gradient of a function f is denoted as ∇f, and consists of a vector of partial derivatives
Which of the following correctly describes a partial derivative?
It’s the derivative of a function with respect to one variable, keeping the others constant
What is reverse-mode differentiation?
Reverse-mode differentiation is used in backpropagation because it efficiently computes gradients of a scalar output with respect to many inputs. It’s perfect for training neural networks.
What is the vanishing- and exploding gradient problem?
say we have a network where the weights are either to small or too big, we’ll run into the problem of the gradients vanishing or exploding respectively.
Multiplication and nonlinearities at each step, amplify the signal, so the deeper you move into a neural network, we take take the output of the previous layer, and apply matrix multiplication (weights) and then a nonlinear activation function, such as ReLU:
- if the weights are small, the gradients shrink exponentially
- if the weights are big, the gradients grow exponentially
for deep MLP, the influence of deep inputs diminished, i.e. for length greater than 10, therefore rule of thumb = 10 layers
How does training in batches help in gradient-based optimisation?
It can prevent some the problem of vanishing and exploding gradients, by training in batches of the data for each epoch.
What are the different methods of training in batches?
stochastic gradient descent SGD (incremental/online training):
- its batch_size is 1. it only uses 1 training sample to update the weights
- the data is shuffled, before each epoch
- advantage: doesn’t get stuck in local minima
- disadvantage: it can become very noisy (red lines)
batch gradient descent (batch training):
- uses all the training data at once
- very stable performance, updates the weights
- advantage: computationally efficient
- disadvantage: may get lost in local minima (blue line)
minibatch gradient descent:
- uses a subset of the training data
- it is more computationally efficient than batch gradient descent
- advantage: it avoid getting stuck in local minima
- disadvantage: it introduces new hyper-parameters to optimise
What are the advantages of using optimisers like Adam and Adagrad?
heuristic stochastic gradient descent: uses a diagonal preconditioner → Adagrad, Adam
- adagrad: adapts the learning rate to the frequency of parameters.
- good for sparse data, bad because the learning rate can shrink
- adam: one of the most popular optimisation algorithms in deep learning
- good for noisy gradients, sparse data
What are the heuristics in gradient-based optimisation?
We train an mlp by iteratively update the weights, theta. the goal is to minimise the loss function with the weight estimate in them.
exact computation: it uses the Hessian matrix, which is very precise, but very expensive
standard stochastic gradient descent: uses only the gradient, which is cheap to compute, but very slow and unstable
heuristic stochastic gradient descent: uses a diagonal preconditioner → Adagrad, Adam
- adagrad: adapts the learning rate to the frequency of parameters.
- good for sparse data, bad because the learning rate can shrink
- adam: one of the most popular optimisation algorithms in deep learning
- good for noisy gradients, sparse data
SGD + momentum: Increases the learning rate if the sign of gradient does not change
Adagrad: Maintains per-parameter learning rate → good with sparse gradients
RMSprop: Like Adagrad but exponentially decaying mean of grad → good with noisy gradients
Adam: Combines both by looking at the decaying mean as well as variance of the gradients
What is the purpose of splitting data into training, validation, and test sets?
partition your available data into three independent sets:
training set - trains the model
validation set - validates and help you tune hyperparameters on the training set. also avoids overfitting the parameters on the training set
test set - last thing you do, you don’t use this to help tune hyperparameters. this is to test the generalisation error on the best chosen model, so you don’t overfit the hyperparameters on the validation set
What is a Convolutional Neural Network (CNN)?
A type of deep neural network designed for processing data with a spatial structure, such as images.
Deep neural network: convolutional neural network (CNN)
- we can learn functions for input with a 2d spatial structure
- the hierarchy of filters for template matching of the image patches
- Early layers detect simple features (edges, corners).
- Deeper layers learn composite features (like object parts) using earlier ones.
What do convolutional filters do in a CNN?
A filter is a small matrix that slides over the image and performs a convolution operation to extract certain features like edges. we convolve input with a filter
- we can shift the filter/kernel over the input pattern (i.e. image)
- we can calculate the feature map in a (smaller) projection
- we want the highest values of the feature map, so the values that look similar to the filter
What kind of filters are there?
convolution kernels (filters)
edge detector (laplacian filter):
- Emphasizes edges in all directions.
- Result: Highlights where the image intensity changes rapidly.
Laplacian (alternative):
- Another version of an edge detector with slightly different emphasis.
Sobel horizontal:
- Detects horizontal edges.
Sobel vertical:
- Detects vertical edges.
These filters help CNNs learn to extract features (like lines, edges, textures) from raw images at lower layers. These features are then used in deeper layers to understand shapes, objects, and more abstract patterns.
What is the function of pooling in CNNs?
pooling: pools information towards a higher level. a kind of dimensionality reduction
- can be used for classifying an object just somewhere in the image, we don’t care where the object is located, we just want to know what is in the image
- higher level information invariant to location, so after pooling, the model will focus on what features are present, not exactly there
- the pooling operators are a fixed transformation → no weights, there’s no gradients or backpropogation, the pooling is like a hard-coded rule
- global pooling: pool full array or tensor, instead of using small patcges, the glonbal pooling reduces the entire feature map to one value per channel
variants:
- average pooling → uses the average of the values, smooth features
- max pooling → uses the maximum of the values, very sharp features
What is padding and striding?
padding: an action done to the edges of an input image or a feature map during convolution
- zero padding: adds a border of 0’s around the whole input
- other strategies exist
- Copy the edge pixels (e.g. “replicate” padding)
- this is used to maintain the size of a feature map = the size of an input pattern
stride: defines how far the kernel jumps each time it moves (a window)
- more efficient if we have a stride > 1, to skip some overlap
- gives us a smaller sized feature map
What is a residual connection in a deep neural network?
- Shortcut connections (also called skip connections) that add the input directly to the output of a block.
in cnns we have residual blocks:
- we have shallow/small mappings over the features transformation
- we will still have the same number of parameters (weights), as the block uses the same number of weights as a regular block.
residual connections can span arbitrary depth:
- shortcut connections can skip layers
- thereby we maintain a reference to un-transformed infromation
- it is lso more robust against perturbations in the inputs
What does identity mapping mean?
identity mapping: takes the same input outputs the same output, so it has been unchanged.
A transformation where the input is added to the output of a block without change, ensuring stable gradient flow.
- Shortcut connections perform an identity mapping, allowing the input x to bypass certain layers, so we will have information that is both transformed but also unchanged, depends on where the skip connection is
Why are residual connections useful?
residual connections can help:
- mitigate the vanishing gradient problem, as the gradients flor through the network more effectively now
- if we have a very deep neural network, the more we backpropagate the smaller the gradients get throughout training, this can be mitigated using skip connections
What role do computational graphs play in neural networks?
Backpropagation works by applying the chain rule backward through a computational graph, efficiently computing how changes in inputs affect the final output.
What is the log-sum-exp trick used for?
To improve numerical stability when computing log-likelihoods or softmax functions in neural networks.
Jacobian matrix: first-order partial derivatives (easier to compute, mostly used for neural networks). We can compute the full Jacobian matrix, but we can also just compute the vector Jacobian product.
Hessian matrix: second-order partial derivatives
For classification tasks, the cross-entropy loss is preferred because it effectively compares probability distributions and yields more informative gradients than squared error, especially when paired with sigmoid or softmax activations. To ensure numerical stability during loss calculation, the log-sum-exp (lse) trick is applied to prevent overflow/underflow.