Connectionist Prep Flashcards

(53 cards)

1
Q

What is learning by gradient descent? Explain the general idea behind it, and the role the error E has in it.

A
  • algorithm which aims to minimise the error (E) of the NN by adjusting the model’s parameters
  • involves computing gradients of E with respect to these parameters
  • the parameters are adjusted in the opposite direction of the gradient to minimise E
  • iteratively reduces E, in pursuit of a global minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Briefly describe what the backpropagation algorithm is, and in which way it relates to gradient descent.

A
  • algorithm used to train artificial NN by minimising the error between the predicted output and the target values
  • Two main phases: Forward Pass, Backward Pass (backpropagation)
  • FP: Input data is fed into the NN, layer-by-layer computations yield the predicted output
  • Backpropagation: works backwards through the layers, using gradient descent to minimise the error of the NN by adjusting the model’s parameters at each layer.
  • involves calculus to calculate the partial derivatives of the loss function with respect to each parameter (weights and bias)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the common problems of gradient descent, that may limit its effectiveness?

A
  • local minima
  • slow convergence
  • sensitivity to learning rate
  • dependence on initial weight selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the role of activation functions in NN

A

They play a crucial role by introducing non-linearities to the model, which are essential for enabling NN to learn complex patterns in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the purpose of the cost function in a NN

A

Also known as the loss function, it quantifies the inconsistency between predicted values and the corresponding correct values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

explain the role of bias terms in a NN

A
  • bias terms add a level of flexibility and adaptability to the model.
  • they “shift” the activation function, providing every neuron with a trainable constant value, in addition to the inputs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a perceptron

A

an artificial neuron which takes in many input signals and produces a single binary output signal (0 or 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the differences between Batch Gradient Descent and Stochastic Gradient Descent

A

in BGD, the model parameters are updated in one go, based on the average gradient of the entire training dataset. In SGD, updates occur for each training example or mini-batch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which gradient descent is preferred for large datasets and why.

A

Stochastic GD is preferred over Batch GD.

  • Although BGD usually converges to a more accurate minimum, it is computationally expensive (extremely)
  • SGD converges faster and requires less memory. However, updates can be noisy, and it may converge to a local minimum rather than the global minimum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define generalisation

A

The ability of a trained model to perform well on unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you measure the generalisation ability of a MLP

A
  • cross validation
  • hold-out strategy (train/test sets)
  • consider choice of evaluation measure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you decide on an optimal number of hidden units?

A
  • apply to domain knowledge to estimate a range
  • test the model on the range to fine tune the selection
  • this may be unfeasible on complex models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Explain the difference between two common activation functions of your choice

A

Sigmoid vs TanH
1. Output Range:
- Sigmoid: (0,1): used for binary classification
- tanh: (-1, 1): suitable for zero-centred data
2. Symmetry:
- Sigmoid is asymmetric, biased towards positive values
- tanh is symmetric around the origin (0, 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the problems with squared error as the loss function, give two alternatives

A

There are tricky problems with squared error:

  • if the desired output is 1 and the actual output is very close to 0, there is almost no gradient
  • alternatives: softmax, relative entropy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define what a Deep Neural Network is

A
  • consists of multiple layers that transform the input in a hierarchical fashion
  • they typically are feed-forward NN with multiple hidden layers, allowing modelling of complex non-linear relationships
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Formal definition of overfitting in practice

A

during learning the error on the training examples decreases all along, but the error in generalisation reaches a minimum and then starts growing again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Training data contains information about the regularities in the mapping from input to output.
But it also contains noise, explain how.

A
  • the target values may be unreliable
  • there will be accidental regularities just because of the particular training cases that were chosen
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When we fit a model, it cannot tell which regularities are real and which are caused by sampling error. Which regularity does it fit, what is the worst case scenario?

A
  • Both
  • worst case: If the model is very flexible it can model the sampling error really well
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does a model having the “right capacity” entail

A
  • enough to model the true regularities
  • not enough to also model the spurious regularities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How to prevent overfitting in NN

A
  • limiting number of weights
  • weight decay
  • early stopping
  • combining diverse networks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Standard ways to limit the capacity of a neural net

A
  • Limit the number of hidden units.
  • Limit the size of the weights.
  • Stop the learning before it has time to overfit.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to limit the size of the model by using fewer hidden units in practice

A

trial and error

23
Q

What is weight-decay

A
  • method for limiting the size of a model
  • involves adding an extra term to the cost function that penalises the squared weights, i.e. keeps weights small unless they have large error derivatives
24
Q

What does weight decay prevent, and what does it improve and how?

A
  • It prevents the network from using weights that it does not need.
  • It tends to keep the network in the linear region, where its capacity is lower.
  • This helps to stop it from fitting the sampling error. It makes a smoother model in which the output changes more slowly as the input changes.
  • It can often improve generalisation a lot.
25
What is the idea behind early stopping for preventing overfitting
- expensive to train a big model with lots of data - cheaper to stop adjusting weights once generalisation starts getting worse - the capacity is limited because the weights have not had time to grow big
26
what hold out strategy is recommended for model selection of a MLP
Strategies which include a **validation set**
27
What is gradient descent
- optimisation technique used to find optimal parameters of a model by iteratively updating them in the direction of the steepest descent of the loss function. - aims to minimise the error of the model
28
What is the recommended method for the three-way hold-out strategy when training a NN
- **training data:** used for learning the parameters of the model - **validation data:** used for deciding what type of model and what amount of regularisation works best **(fine tuning)** - **test data:** used to get a final, unbiased estimate of how well the network works
29
Why NN ensembles?
The average error of a group of predictors is always smaller than the average error of the single predictors (unless the predictors are identical)
30
Briefly explain the steps of k-Fold Cross Validation
- Divide the data into k disjoint subsets - “folds” - For each of k experiments, use k-1 folds for training and the selected one fold for testing. - Repeat for all k folds, average the accuracy/error rates.
31
How to achieve Network Ensembling with just one training
Use **Dropout** method - during training, at each step knock out some randomly chosen connections - when predicting, **use all connections**. You will need to introduce a **normalising constant** for this to work - **equivalent to having a very large ensemble of networks**
32
Precautions for applying dropout in practice
- it doesn't always work: **preconditions required** - Must begin with an oversized net capacity to avoid underfitting
33
What is online learning
Weight updates occur for **each example** during Gradient Descent
34
Discuss the value of the gradient at different error angles, what does gradient descent do at these values and are we satisfied with this?
- the gradient is large where the error is steep, small where the error is flat - Sometimes we would like to run where it's flat and slow down when it gets too steep. GD does **precisely the contrary**
35
Briefly discuss some of the fixes for the issues of Gradient Descent
Use an **adaptive learning rate**: - increase the rate slowly if it's not diverging - decrease the rate quickly when it starts diverging Use **Momentum**: instead of using the gradient to change the position of the weight, change the velocity of the change Use **fixed step**: GD decides where to go, but always at same pace **Normalise the gradient** based on some combination of previous gradients
36
Explain what RL is, and what we want to learn from it
Learning from interaction with an environment to achieve some long-term goal that is related to the state of the environment - we want to learn how to act to accomplish goals - given an environment that contains rewards, we want to learn a policy for acting
37
Define a simple RL Setup And Goal
Setup: We have an **agent** which is interacting with an environment which it can affect through **actions**. The agent may be able to sense the environment partially or fully. Goal: the agent tries to maximise the long term reward conveyed using a **reward signal**
38
Explain the differences between Supervised and Reinforcement Learning
- In SL, there's an external "supervisor", which has knowledge of the environment and who **shares it with the agent** to complete the task - Both strategies use **mappings between inputs and outputs**, but in RL there is a reward function which acts as a **feedback** to the agent - Supervised learning relies on **labelled training data**
39
Explain the differences between Unsupervised and Reinforcement Learning
- In UL, there is **no feedback from the environment** - In UL, task is to find the **underlying patterns** rather than the mapping from input to output
40
Characteristics of RL
- no supervisor, only a reward signal - feedback is delayed, not instantaneous - time really matters - Agent's actions have immediate consequences
41
Why is Deep Learning hard
- when networks get deep the gradient **vanishes** - when a network is untrained, the deeper down a hidden unit is, the more **subtle** its effect on the outputs - this means it doesn't do much to the error if you change it
42
Briefly discuss the two main categories for Deep Learning solutions
Pre-training: - stack deep networks layer-by-layer - make sure each layer represents the previous layer meaningfully before adding another layer Use Artificial Targets: the real problem is that inner layers don't get gradient, so every now and then use **hard targets**: - generate some random targets for the layer - evaluate them all - use the best one
43
Advantages and disadvantages of pre-training by auto-association
Adv: you can use unlabelled data Disadvantage (potentially disastrous): - you aren't considering at all the **property you want to predict** - **you compress regardless of the property**. If it's lossy, the loss can be in the wrong place..
44
Advantages and disadvantages of pre-training without auto-association
Advantage: - you **compress based on the property you are trying to predict**. If it's lossy, the loss is probably in the right place Disadvantage: - can't use unlabelled training data - shorter training
45
What is clustering
- unsupervised learning: grouping un-labelled data - find underlying patterns in the data - large choice of distance functions - partitioning and hierarchical methods
46
Possible Clustering implementations for connectionist models
k-means can be achieved by using backpropagation in a **non-linear self-associating network**: - there is one hidden layer with each node representing a cluster centre - the hidden layer is **hardmax**, therefore only one of the neurons will be activated from the input - the neurons weight will be adjusted when it is activated **(recomputes cluster centre)**
47
What are the strengths and weaknesses of Clustering compared to Principal Component Analysis (in connectionist models)?
PCA: linear self-associating networks Clustering: non-linear (hardmax) self-associating networks - PCA builds **global features** (strength) while clustering builds **local features** (weakness) - PCA only considers **linear** combinations of the inputs (weakness) - clustering builds much **stronger features** (strength)
48
State what the difference is between Feedforward and Feedback networks
**Information Flow:** - In FFN, information flows in one direction - FBNs have recurrent connections, allows them to maintain and propagate information over time - Consequently, FBNs can model sequences and time-dependent data
49
Which network architecture (FFN, FBN) do you think are easier to deal with? Justify your choice.
- FFNs are easier to train and are more stable because there are no feedback loops - FBNs can be more challenging to train due to vanishing gradients
50
Describe Hopfield Networks and Boltzmann Machines
- both are types of Recurrent NN (RNN) - HNs consists of **binary threshold units with symmetric connections** - BMs use **binary stochastic units** and incorporate a **probabilistic aspect in the update rule**
51
Discuss the learning process in Hopfield Networks and Boltzmann machines
Hopfield Networks: - learning involves adjusting the weights to store certain memories or patterns, essentially capturing **second-order interactions** Boltzmann machines - learn to generate configurations according to a probability distribution - involves adjusting the weights based on the **correlation differences in the training and generated data**
52
Do Hopfield Networks and Boltzmann machines tackle similar problems?
- HNs are deterministic, while BMs are probabilistic - Consequently, BMs can represent **higher-order interactions**
53
Discuss similarities and differences between MLPs and Boltzmann Machines
- both can have hidden layers - feedforward vs recurrent - role of hidden units are somewhat similar in both models, aiming to learn complex patterns/structures - the manner in which HU operate are different: deterministic in MLP, probabilistic in BM