Model Training, Tuning and Evaluation Flashcards

(63 cards)

1
Q

What is an activation function?

A

The function inside a neuron that takes in the inputs and uses that to figure out the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a rectified linear unit activation function (ReLU)?

A

An activation function where you have a linear function above 0 and below 0 it just outputs whatever was input.
This is easy and fast to compute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a parametric ReLU activation function?

A

The same as ReLU, except of just outputting the inputs on the back half the slope is learned through back-propagation.
Complicated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Swish activation function?

A

Developed by Google. Works very well for deep neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What activation functions do RNNs tend to use?

A

Non-linear activation functions w/ TanH

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are convolutional neural networks mainly used for?

A

Image analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does feature location invariant mean? And which type of neural network is it?

A

Means it doesn’t matter where within an image the key object is. Convolutional neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does a convolutional neural network work?

A

Takes a source image, breaks it into chunks called convolutions. Then slowly layers these convolutions and increases the complexity processed.
For example, start with lines, then with shapes, then recognising and assembling these shapes, then recognising objects, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is something important to note about the source data for CNNs?

A

It must be of the appropriate dimensions - width x length x colour channels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How are neural networks trained? What is the name of the method?

A

Gradient descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an epoch in gradient descent?

A

An iteration of training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the downside of the learning rate being too high or too low?

A

Too high: might overshoot the optimal solution
Too low: will take too long to find the optimal solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of regularisation?

A

The practice of preventing overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are 3 high-level ways of doing regularisation?

A
  • Use a simpler model
  • Dropout - remove randomly selected neurons to reduce overreliance on specific neurons
  • Early stopping - automatically stop the training earlier than the dictated epochs to prevent overfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is L1 regularisation?

A

A regularisation method where a regularisation term is added as your weights are learned. L1 is the sum of the weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is L2 regularisation?

A

A regularisation method where a regularisation term is added as your weights are learned. L2 is the sum of the square of the weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the difference between L1 and L2 regularisation?

A

L1 the regularisation term is the sum of the weights, L2 the regularisation term is the sum of the square of the weights. L2 means that none of the features are lost, just weighted, whereas L1 some entire features can disappear.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are 4 ways to counteract the vanishing gradient problem?

A

Use a multi-level hierarchy (break the levels into their own sub-networks that are individually trained)
Use long short-term memory (RNNs)
Use residual networks (an ensemble of smaller networks)
Use a different activation functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the vanishing gradient problem?

A

When the slope of the learning curve approaches 0 or 1, the changes become very very small, which can cause issues and means that the NN is learning very slowly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why can accuracy be a problematic metric?

A

Because it can be misleading. For instance, a test for a super super rare disease could be right 99.9% of the time by just always guessing that the person doesn’t have it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is recall?

A

The number of correct positive predictions out of the number of actual positives there are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is precision?

A

The number of correct positive predictions out of the amount positive predictions (like specificity but for positivies)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is specificity?

A

The number of correct negative predictions out of the amount of negative predictions (like precision but for negatives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

In layman’s terms, what is F1 score?

A

The mean of precision and recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
RMSE, MAE and R-Squared cannot be used for _____ ML problems.
Classification
26
What is the perfect area under the Receiver Operating Characteristic (ROC) Curve?
1 is perfect
27
What is a ROC curve?
Receiver Operating Characteristic curve. The plot of the true positive rate vs the false positive rate at various thresholds
28
What is R^2?
Square of the correlation coefficient between observed outcomes and predicted outcomes
29
What is RMSE?
Root mean squared error. The error between the predicted values and actual values on average
30
What is an ensemble method?
A method used to combine several algorithms to all work together on a prediction
31
What is bagging in ML?
Generating new training sets by random sampling with replacement. Each resampled model can then be retrained in parallel
32
What is boosting in ML?
Weights are assigned to observations, which are then adjusted as the models progress sequentially. Generally yields good accuracy but prone to overfitting.
33
Which ensemble method can be parallelised and thus can happen faster?
Bagging
34
What is Automatic Model Tuning (AMT) in SageMaker?
Automatic tuning of your SageMaker model's hyperparameters
35
What has to be specified for SM AMT?
The hyperparameters you care about, the ranges you want to try, and the metrics you are optimising for
36
What is warm starting in SM AMT?
The act of using on or more previous tuning jobs as a starting point to inform automatic model tuning what hyperparameter combinations to try next
37
What are the 4 tuning approaches in AMT?
* Random search * Grid search * Bayesian optimisation * Hyperband
38
What type of algorithms is hyperband tuning in AMT good for? What is a benefit of using hyperband tuning?
A method of tuning that works well with iterative algorithms. Faster than random search or Bayesian optimisation.
39
What is a downside of using grid search in SM AMT?
Computationally expensive
40
What is an upside and a downside of using random search in SM in AMT?
Upside: No dependence on prior runs so can run completely in parallel Downside: Might not yield an optimal solution
41
What is SageMaker Autopilot? What is the 'auto' part and what do you have to do?
A wrapper on a different product called AutoML. You load the data from S3 and say which column you would like to predict and the data pre-processing, algorithm selection, hyperparameter tuning, infrastructure provisioning is all handled completely for you.
42
What two data formats are applicable for SageMaker Autopilot?
Tabular CSV or Parquet
43
What service would you integrate with SageMaker Autopilot to give some explainability to results?
SageMaker Clarify
44
What are the 3 different SageMaker Autopilot modes?
Hyperparameter Optimisation, Ensembling, Auto
45
What is SageMaker Notebooks and what is it useful for?
Jupyter Notebooks in SageMaker Studio. Allows you to share between users.
46
Can SageMaker Debugger trigger notifications?
Yes, sent through SNS
47
What is SageMaker Model Registry?
A catalog of your models where you can associate metadata with each of them.
48
In terms of governance, why is SageMaker Model Registry useful?
Can be used to manage approval status of a model and share models across your organisation for different uses
49
What are SageMaker Model Cards?
Web pages that say what your models are meant to do, what data is was trained on
50
What is TensorBoard?
A 3rd party tools for visualising metrics associated w/ your model as it is being trained, such as loss and accuracy. Includes graphs for visualisation
51
How does TensorBoard integrate with SageMaker?
Through the console or through a URL. Does require modifications to your training script, however
52
What is SageMaker Training Compiler?
A feature that compiles and optimises training jobs on GPU instances for you automatically
53
What is SageMaker Warm Pools? Why is this useful?
A feature that allows you to retain and re-use provisioned infrastructure. Means that you don't have to keep spooling up the same infrastructure over and over again
54
Is SageMaker Training Compiler compatible with distributed training libraries?
No
55
What is SageMaker Checkpointing?
Creates snapshots during your model training, allowing you to restart from these if something goes wrong
56
What is SageMaker Cluster Health Checks?
Health checking feature that checks the health of the instances in your cluster and performs actions as required e.g. replace faulty instances
57
Where are SageMaker Checkpoint snapshots stored?
S3
58
What is distributed model parallelism?
For very large models where a maxed out single instance is not enough. You split shards of the model and optimiser states across multiple GPUs. E.g. each GPU might handle 2 layers' optimiser states
59
What is distributed data parallelism?
You perform distributed model parallelism (split shards of the model and optimiser states across GPUs) AND shard the trainable parameters and gradients across GPUs as well
60
What is an optimiser state?
The current weights of the model
61
At what scale of parameters is optimiser state sharding useful?
Generally for more than 1Bn parameters
62
What does an Elastic Fabric Adapter do?
Accelerate your training by making better use of your bandwidth - the idea is to make it equivalent to having an on-premises HPC cluster
63
What is a downside of using an Elastic Fabric Adapter?
Only works w/ Nvidia collective communication library. In short, you are locked in to Nvidia GPUs