Topic 3: Deep Learning Basics & Optimisation & CNNs Flashcards by Niko Dice

What is a Multi-Layer Perceptron (MLP)?

A neural network consisting of multiple layers of neurons with nonlinear activation functions, allowing it to learn complex patterns and act as a universal function approximator.

multi layer perceptron:
- recursive recombination of binary classifier
- with arbitrary depth and size: universal function approximator - this function gives us the ability to use any arbitrary depth and size, but it must be differentiable

I = identity function

x_1 and x_2 are activating h_1 and h_2, and they get some bias.

How well did you know this?

Not at all

Perfectly

Why are activation functions important in neural networks?

We use them to introduce some non-linearity into the neural network.
- they mimic the neural activation spike after membrane potential reaches threshold
- they’re decisive in the way that they can make a clear binary decision, shuld this neuron be activated
- they’re differentiable, meaning they can enable backpropagation
- they’re efficient to evaluate and derive

How well did you know this?

Not at all

Perfectly

What are common activation functions used in deep learning?

Sigmoid, Tanh, and ReLU. ReLU is preferred in deep networks due to reduced vanishing gradient problems.

Sigmoid squashes values between 0 and 1 but suffers from vanishing gradients.

Tanh outputs values between -1 and 1 and has stronger gradients than sigmoid but still vanishes at extremes.

ReLU (Rectified Linear Unit) outputs 0 for negative inputs and identity for positive inputs, enabling sparse activation and mitigating vanishing gradient issues. It is preferred in deep networks.

How well did you know this?

Not at all

Perfectly

What is backpropagation?

You train an MLP iteratively by updating θ.

we start by taking a batch of training data, then we do a forward pass where we calculate the corresponding loss L. then we do a backward pass where we propagate to get gradients. then we use the gradients to update the weights of the network, and we continue like so until we have reached convergence (or a low loss)

You make forward passes, and then you propagate backwards throughout the network and then you update the weights of the network and start from scratch with the new weights

How well did you know this?

Not at all

Perfectly

What is a loss function?

The loss function is dependent on the activation function and the target you’re trying to represent. we use the gradients of the MLE. the loss function that’s often used is the cross-entropy.
We update the loop. we scale the gradients g by the learning rate, then we iterate until the loss is small enough and has converged as much as it can. an epoch is an iteration over ALL the data

How well did you know this?

Not at all

Perfectly

What does the gradient represent in multivariable calculus?

The direction of the steepest ascent (uphill). The graident of a function represents the direction of the steeps ascent, as the vector points in the direction where f increases most rapidly, if it was descent, you would say -∇f (like in neural network) not, ∇f. The gradient of a function f is denoted as ∇f, and consists of a vector of partial derivatives

How well did you know this?

Not at all

Perfectly

Which of the following correctly describes a partial derivative?

It’s the derivative of a function with respect to one variable, keeping the others constant

How well did you know this?

Not at all

Perfectly

What is reverse-mode differentiation?

Reverse-mode differentiation is used in backpropagation because it efficiently computes gradients of a scalar output with respect to many inputs. It’s perfect for training neural networks.

How well did you know this?

Not at all

Perfectly

What is the vanishing- and exploding gradient problem?

say we have a network where the weights are either to small or too big, we’ll run into the problem of the gradients vanishing or exploding respectively.

Multiplication and nonlinearities at each step, amplify the signal, so the deeper you move into a neural network, we take take the output of the previous layer, and apply matrix multiplication (weights) and then a nonlinear activation function, such as ReLU:
- if the weights are small, the gradients shrink exponentially
- if the weights are big, the gradients grow exponentially

for deep MLP, the influence of deep inputs diminished, i.e. for length greater than 10, therefore rule of thumb = 10 layers

How well did you know this?

Not at all

Perfectly

How does training in batches help in gradient-based optimisation?

It can prevent some the problem of vanishing and exploding gradients, by training in batches of the data for each epoch.

How well did you know this?

Not at all

Perfectly

What are the different methods of training in batches?

stochastic gradient descent SGD (incremental/online training):
- its batch_size is 1. it only uses 1 training sample to update the weights
- the data is shuffled, before each epoch
- advantage: doesn’t get stuck in local minima
- disadvantage: it can become very noisy (red lines)

batch gradient descent (batch training):
- uses all the training data at once
- very stable performance, updates the weights
- advantage: computationally efficient
- disadvantage: may get lost in local minima (blue line)

minibatch gradient descent:
- uses a subset of the training data
- it is more computationally efficient than batch gradient descent
- advantage: it avoid getting stuck in local minima
- disadvantage: it introduces new hyper-parameters to optimise

How well did you know this?

Not at all

Perfectly

What are the advantages of using optimisers like Adam and Adagrad?

heuristic stochastic gradient descent: uses a diagonal preconditioner → Adagrad, Adam

adagrad: adapts the learning rate to the frequency of parameters.
- good for sparse data, bad because the learning rate can shrink
adam: one of the most popular optimisation algorithms in deep learning
- good for noisy gradients, sparse data

How well did you know this?

Not at all

Perfectly

What are the heuristics in gradient-based optimisation?

We train an mlp by iteratively update the weights, theta. the goal is to minimise the loss function with the weight estimate in them.

exact computation: it uses the Hessian matrix, which is very precise, but very expensive

standard stochastic gradient descent: uses only the gradient, which is cheap to compute, but very slow and unstable

heuristic stochastic gradient descent: uses a diagonal preconditioner → Adagrad, Adam
- adagrad: adapts the learning rate to the frequency of parameters.
- good for sparse data, bad because the learning rate can shrink
- adam: one of the most popular optimisation algorithms in deep learning
- good for noisy gradients, sparse data

SGD + momentum: Increases the learning rate if the sign of gradient does not change

Adagrad: Maintains per-parameter learning rate → good with sparse gradients

RMSprop: Like Adagrad but exponentially decaying mean of grad → good with noisy gradients

Adam: Combines both by looking at the decaying mean as well as variance of the gradients

How well did you know this?

Not at all

Perfectly

What is the purpose of splitting data into training, validation, and test sets?

partition your available data into three independent sets:

training set - trains the model

validation set - validates and help you tune hyperparameters on the training set. also avoids overfitting the parameters on the training set

test set - last thing you do, you don’t use this to help tune hyperparameters. this is to test the generalisation error on the best chosen model, so you don’t overfit the hyperparameters on the validation set

How well did you know this?

Not at all

Perfectly

What is a Convolutional Neural Network (CNN)?

A type of deep neural network designed for processing data with a spatial structure, such as images.

Deep neural network: convolutional neural network (CNN)
- we can learn functions for input with a 2d spatial structure
- the hierarchy of filters for template matching of the image patches
- Early layers detect simple features (edges, corners).
- Deeper layers learn composite features (like object parts) using earlier ones.

How well did you know this?

Not at all

Perfectly

What do convolutional filters do in a CNN?

Study These Flashcards

A filter is a small matrix that slides over the image and performs a convolution operation to extract certain features like edges. we convolve input with a filter

we can shift the filter/kernel over the input pattern (i.e. image)
we can calculate the feature map in a (smaller) projection
we want the highest values of the feature map, so the values that look similar to the filter

What kind of filters are there?

Study These Flashcards

convolution kernels (filters)

edge detector (laplacian filter):

Emphasizes edges in all directions.
Result: Highlights where the image intensity changes rapidly.

Laplacian (alternative):

Another version of an edge detector with slightly different emphasis.

Sobel horizontal:

Detects horizontal edges.

Sobel vertical:

Detects vertical edges.

These filters help CNNs learn to extract features (like lines, edges, textures) from raw images at lower layers. These features are then used in deeper layers to understand shapes, objects, and more abstract patterns.

What is the function of pooling in CNNs?

Study These Flashcards

pooling: pools information towards a higher level. a kind of dimensionality reduction

can be used for classifying an object just somewhere in the image, we don’t care where the object is located, we just want to know what is in the image
higher level information invariant to location, so after pooling, the model will focus on what features are present, not exactly there
the pooling operators are a fixed transformation → no weights, there’s no gradients or backpropogation, the pooling is like a hard-coded rule
global pooling: pool full array or tensor, instead of using small patcges, the glonbal pooling reduces the entire feature map to one value per channel

variants:
- average pooling → uses the average of the values, smooth features
- max pooling → uses the maximum of the values, very sharp features

What is padding and striding?

Study These Flashcards

padding: an action done to the edges of an input image or a feature map during convolution
- zero padding: adds a border of 0’s around the whole input
- other strategies exist
- Copy the edge pixels (e.g. “replicate” padding)
- this is used to maintain the size of a feature map = the size of an input pattern

stride: defines how far the kernel jumps each time it moves (a window)
- more efficient if we have a stride > 1, to skip some overlap
- gives us a smaller sized feature map

What is a residual connection in a deep neural network?

Study These Flashcards

Shortcut connections (also called skip connections) that add the input directly to the output of a block.

in cnns we have residual blocks:
- we have shallow/small mappings over the features transformation
- we will still have the same number of parameters (weights), as the block uses the same number of weights as a regular block.

residual connections can span arbitrary depth:

shortcut connections can skip layers
thereby we maintain a reference to un-transformed infromation
it is lso more robust against perturbations in the inputs

What does identity mapping mean?

Study These Flashcards

identity mapping: takes the same input outputs the same output, so it has been unchanged.
A transformation where the input is added to the output of a block without change, ensuring stable gradient flow.

Shortcut connections perform an identity mapping, allowing the input x to bypass certain layers, so we will have information that is both transformed but also unchanged, depends on where the skip connection is

Why are residual connections useful?

Study These Flashcards

residual connections can help:

mitigate the vanishing gradient problem, as the gradients flor through the network more effectively now
- if we have a very deep neural network, the more we backpropagate the smaller the gradients get throughout training, this can be mitigated using skip connections

What role do computational graphs play in neural networks?

Study These Flashcards

Backpropagation works by applying the chain rule backward through a computational graph, efficiently computing how changes in inputs affect the final output.

What is the log-sum-exp trick used for?

Study These Flashcards

To improve numerical stability when computing log-likelihoods or softmax functions in neural networks.
Jacobian matrix: first-order partial derivatives (easier to compute, mostly used for neural networks). We can compute the full Jacobian matrix, but we can also just compute the vector Jacobian product.

Hessian matrix: second-order partial derivatives

For classification tasks, the cross-entropy loss is preferred because it effectively compares probability distributions and yields more informative gradients than squared error, especially when paired with sigmoid or softmax activations. To ensure numerical stability during loss calculation, the log-sum-exp (lse) trick is applied to prevent overflow/underflow.

Why are CNNs effective for image tasks?

They exploit spatial hierarchies in data through local receptive fields, weight sharing, and spatial invariance via pooling.

What are the two papers from this lecture?

**ImageNet Classification with Deep Convolutional Neural Networks**: In this landmark 2012 paper, the authors introduced **AlexNet**, a deep convolutional neural network (CNN) that significantly advanced the field of computer vision. By training on the ImageNet dataset, which contains over 1.2 million high-resolution images across 1,000 categories, AlexNet achieved unprecedented accuracy in image classification tasks. (https://papers.nips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) **ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION**: conclusion: Adam is a robust and efficient optimization algorithm that combines the benefits of adaptive learning rates and momentum. Its ability to handle sparse gradients and noisy objectives makes it well-suited for training deep learning models across various applications. (https://arxiv.org/pdf/1412.6980)

Describe the effect of the convolutional layers, what do these filters represent?

Effects of convolutional layers: Convolutional layers apply filters (also called kernels) that slide over the input data (e.g., an image) and produces feature maps. These feature maps highlight important patterns or structures in the data. Each filter detects local patterns, such as: - Edges (horizontal, vertical, diagonal) - Textures - Shapes - Object parts (in deeper layers) This makes CNNs translation-invariant, as they can recognize patterns regardless of position. What do the filters represent: Each filter is a small matrix of learnable weights that is trained to detect specific features: - Early layers learn simple patterns like edges or blobs - Middle layers learn textures, corners, or shapes - Deeper layers learn semantic concepts like faces, wheels, eyes, etc. The filters essentially represent "what to look for" at each location in the input.

Topic 3: Deep Learning Basics & Optimisation & CNNs Flashcards

(27 cards)