[2] Neural Networks Flashcards by Marrick Lip

What determines the output of a neuron?

It is the biased, weight sum of its inputs passed into an activation function

How well did you know this?

Not at all

Perfectly

Why are activation functions important?

They allow the network to learn non-linearities

How well did you know this?

Not at all

Perfectly

What is a perceptron?

A special type of ANN with:

Real-valued inputs
Binary output
Threshold activation function

How well did you know this?

Not at all

Perfectly

How are perceptrons trained?

Increase the weights (this is the threshold) based on whether the class is higher or lower than the perceptron

How well did you know this?

Not at all

Perfectly

What idea limits the generalisability of perceptrons?

The Perceptron Convergence Theorem stats that perceptrons will converge if and only if the problem is linearly separable

Hence, they can’t learn XOR

How well did you know this?

Not at all

Perfectly

What are the general approaches to updating weights?

Online learning updates weights after every instance; offline learning does it after every epoch.

Batch learning updates weights after every batch of instances

How well did you know this?

Not at all

Perfectly

What algorithm is used to train neural networks?

Backpropagation:
[1] Calculate the predicted output using the current weights
[2] Calculate the error
[3] Update each weight in proportion to its gradient to the error i.e. how much changing that weight affects the error

Note: weights are trained backwards i.e. start at the last hidden layer

How well did you know this?

Not at all

Perfectly

What are some potential issues when using backpropagation?

Improper learning rate leads to divergence or slow convergence

Overfitting if training too long, for using too many weights, or using too few instances

Local minima

How well did you know this?

Not at all

Perfectly

How should variables be represented in an ANN?

Use a binary representation (i.e. one hot encoding) for nominal variables

For numeric variables, consider scaling or standardisation

How well did you know this?

Not at all

Perfectly

What is scaling and standardization? When should each be used?

Scaling - scale then numbers between [0,1] if they are on a similar range

Standardisation - assume a normal distribution and scale it to N(0,1) if the values are more varied

How well did you know this?

Not at all

Perfectly

What can happen if ANN weights aren’t set appropriately?

If they are all set to 0, the network will be symmetric i.e. all the weights will change together, and so it won’t train

If the weights are too high, the activation will be in the part of the sigmoid with a shallow gradient, and so training will be slow,

How well did you know this?

Not at all

Perfectly

How should ANN weights be set?

Using fan-in factor, i.e. using a uniform random generator between -1/sqrt(d) and 1/sqrt(d) where d is the number of inputs

This ensures the variance of the weighted sum is approximately 1/3

How well did you know this?

Not at all

Perfectly

How can back propagation be sped up?

With momentum, in which gradients from previous steps are used in addition to the current gradient

How well did you know this?

Not at all

Perfectly

How can weight matricies be visualised?

With Hinton diagrams, in which the size of the square is based on the magnitude; it is white if it is positive and black if it is negative

How well did you know this?

Not at all

Perfectly

What are the key principles of CNNs?

The automatically extract features to produce a feature map

They are not fully connected - convolutions with shared weights are used instead

How well did you know this?

Not at all

Perfectly

What are the dimensions of a feature map?

Study These Flashcards

In each direction, it is (image_size - filter_size) / shift + 1

What techniques can be applied to optimise CNNs?

Study These Flashcards

Subsampling aggregates based on the maximal value; this reduces data while retaining/emphasizing the information
Weight smoothing use used when domain specific knowledge suggests adjacent inputs are related
Centered weight initialization starts with higher weights in the center, as these are often where objects are found

What is a weight agnostic network?

Study These Flashcards

It has a single weight shared by the whole network; training is done by newtwork topology search.

What operations occur while training a weight agnostic network?

Study These Flashcards

Insert node by splitting an existing connection
Add connection - connect two previously connected nodes
Change activation - change the activation function of a node

What are HONNs?

Study These Flashcards

Higher order neural networks connect each input to multiple nodes in the first hidden layer.

The order is the number of nodes that each input connects to.

CNN are a special type of HONN

Why are HONNs useful?

Study These Flashcards

Instead of just taking the weighted sum, for each combination of inputs they take a sum of the weighted products.

This allows them to explore higher order relationships i.e. products; for example, they can solve XOR

What are self-organizing maps?

Study These Flashcards

They represent high dimensional data in lower dimensions by weighting mapping inputs to neurons

The weights are trained by competitive learning in which the node whose weight is closest to the input value is chosen to fire. It updates its weights to reinforce those that made it win. the neighborhood function preserves topology

What are residual neural networks?

Study These Flashcards

They have shortcut connections between layers.

This makes training more effective as it reduces the vanishing gradient effect

What is EvoCNN?

Study These Flashcards

A GA to automatically train network structures. It uses a two-level encoding to describe the layers and then their connections

Each mutation performs one of three actions:

Add a new unit (convolutional, pooling or full)
Modify an existing unit’s encoded information
Delete an existing unit

What are auto-encoders?

Neural networks that have been trained to copy their input to their output The use an intermediate layer called the latent representation

How is the loss of an auto-encoder calculated?

The difference between the input and output (or a special domain-specific variations)

What are the main configurations of auto encoder?

Under-complete auto encoders have latent representations with smaller dimensions than the input and output. Otherwise, they are over-complete

What do under-complete auto-encoders do?

They learn the most *salient features* of the data

How do over-complete auto-encoders work?

They use regularisation to avoid simply copying the data Sparsity auto encoders trie to push as many output values of the latent representation to 0 as possible Contractive auto encoders regularise by derivative penalty, meaning the output of each node is smooth if the input changes slightly. This makes them robust to slight fluctuations

What are some particular applications of auto encoders?

De-noising auto-encoders remove noise from the image Variational auto-encoders modify an image in a desired way. However, the latency space must might not be continuous and so instead of a point in the latent space, a distribution is used.

Why is cross-entropy used?

It's gradients are more pronounced at extreme values, leading to faster convergence

Why is ReLU often used?

It is fast to compute, minimises the impact of vanishing gradient, and encourages sparsity

What is the purpose of regularisation?

It prevents weights from getting too large, and pushes as many to zero as possible (allowing them to be ignored)

What is a particular type of regularisation?

Lasso regularisation uses L1 to remove irrelevant variables from a linear model

How does dropout work?

A random percentage of neutrons are removed on each mini-batch Note: for inference, the weights must be multiplied by (1-p)%

What are the general strategies for transfer learning?

Learn shared hidden representations (i.e. DLID). This is useful if the classes are the same, but the way they are captured differs (i.e. different camera types) Shared features - use this when the head layers do the same general task, but the tail does a particular task.

[2] Neural Networks Flashcards

(36 cards)