Model Training, Tuning and Evaluation Flashcards by Al Them

What is an activation function?

The function inside a neuron that takes in the inputs and uses that to figure out the output.

How well did you know this?

Not at all

Perfectly

What is a rectified linear unit activation function (ReLU)?

An activation function where you have a linear function above 0 and below 0 it just outputs whatever was input.
This is easy and fast to compute.

How well did you know this?

Not at all

Perfectly

What is a parametric ReLU activation function?

The same as ReLU, except of just outputting the inputs on the back half the slope is learned through back-propagation.
Complicated.

How well did you know this?

Not at all

Perfectly

What is a Swish activation function?

Developed by Google. Works very well for deep neural networks.

How well did you know this?

Not at all

Perfectly

What activation functions do RNNs tend to use?

Non-linear activation functions w/ TanH

How well did you know this?

Not at all

Perfectly

What are convolutional neural networks mainly used for?

Image analysis

How well did you know this?

Not at all

Perfectly

What does feature location invariant mean? And which type of neural network is it?

Means it doesn’t matter where within an image the key object is. Convolutional neural networks.

How well did you know this?

Not at all

Perfectly

How does a convolutional neural network work?

Takes a source image, breaks it into chunks called convolutions. Then slowly layers these convolutions and increases the complexity processed.
For example, start with lines, then with shapes, then recognising and assembling these shapes, then recognising objects, etc.

How well did you know this?

Not at all

Perfectly

What is something important to note about the source data for CNNs?

It must be of the appropriate dimensions - width x length x colour channels.

How well did you know this?

Not at all

Perfectly

How are neural networks trained? What is the name of the method?

Gradient descent

How well did you know this?

Not at all

Perfectly

What is an epoch in gradient descent?

An iteration of training

How well did you know this?

Not at all

Perfectly

What is the downside of the learning rate being too high or too low?

Too high: might overshoot the optimal solution
Too low: will take too long to find the optimal solution

How well did you know this?

Not at all

Perfectly

What is the purpose of regularisation?

The practice of preventing overfitting

How well did you know this?

Not at all

Perfectly

What are 3 high-level ways of doing regularisation?

Use a simpler model
Dropout - remove randomly selected neurons to reduce overreliance on specific neurons
Early stopping - automatically stop the training earlier than the dictated epochs to prevent overfitting

How well did you know this?

Not at all

Perfectly

What is L1 regularisation?

A regularisation method where a regularisation term is added as your weights are learned. L1 is the sum of the weights.

How well did you know this?

Not at all

Perfectly

What is L2 regularisation?

A regularisation method where a regularisation term is added as your weights are learned. L2 is the sum of the square of the weights.

How well did you know this?

Not at all

Perfectly

What is the difference between L1 and L2 regularisation?

L1 the regularisation term is the sum of the weights, L2 the regularisation term is the sum of the square of the weights. L2 means that none of the features are lost, just weighted, whereas L1 some entire features can disappear.

How well did you know this?

Not at all

Perfectly

What are 4 ways to counteract the vanishing gradient problem?

Use a multi-level hierarchy (break the levels into their own sub-networks that are individually trained)
Use long short-term memory (RNNs)
Use residual networks (an ensemble of smaller networks)
Use a different activation functions

How well did you know this?

Not at all

Perfectly

What is the vanishing gradient problem?

When the slope of the learning curve approaches 0 or 1, the changes become very very small, which can cause issues and means that the NN is learning very slowly

How well did you know this?

Not at all

Perfectly

Why can accuracy be a problematic metric?

Because it can be misleading. For instance, a test for a super super rare disease could be right 99.9% of the time by just always guessing that the person doesn’t have it

How well did you know this?

Not at all

Perfectly

What is recall?

The number of correct positive predictions out of the number of actual positives there are

How well did you know this?

Not at all

Perfectly

What is precision?

The number of correct positive predictions out of the amount positive predictions (like specificity but for positivies)

How well did you know this?

Not at all

Perfectly

What is specificity?

The number of correct negative predictions out of the amount of negative predictions (like precision but for negatives)

How well did you know this?

Not at all

Perfectly

In layman’s terms, what is F1 score?

The mean of precision and recall

How well did you know this?

Not at all

Perfectly

RMSE, MAE and R-Squared cannot be used for _____ ML problems.

Classification

What is the perfect area under the Receiver Operating Characteristic (ROC) Curve?

1 is perfect

What is a ROC curve?

Receiver Operating Characteristic curve. The plot of the true positive rate vs the false positive rate at various thresholds

What is R^2?

Square of the correlation coefficient between observed outcomes and predicted outcomes

What is RMSE?

Root mean squared error. The error between the predicted values and actual values on average

What is an ensemble method?

A method used to combine several algorithms to all work together on a prediction

What is bagging in ML?

Generating new training sets by random sampling with replacement. Each resampled model can then be retrained in parallel

What is boosting in ML?

Weights are assigned to observations, which are then adjusted as the models progress sequentially. Generally yields good accuracy but prone to overfitting.

Which ensemble method can be parallelised and thus can happen faster?

Bagging

What is Automatic Model Tuning (AMT) in SageMaker?

Automatic tuning of your SageMaker model's hyperparameters

What has to be specified for SM AMT?

The hyperparameters you care about, the ranges you want to try, and the metrics you are optimising for

What is warm starting in SM AMT?

The act of using on or more previous tuning jobs as a starting point to inform automatic model tuning what hyperparameter combinations to try next

What are the 4 tuning approaches in AMT?

* Random search * Grid search * Bayesian optimisation * Hyperband

What type of algorithms is hyperband tuning in AMT good for? What is a benefit of using hyperband tuning?

A method of tuning that works well with iterative algorithms. Faster than random search or Bayesian optimisation.

What is a downside of using grid search in SM AMT?

Computationally expensive

What is an upside and a downside of using random search in SM in AMT?

Upside: No dependence on prior runs so can run completely in parallel Downside: Might not yield an optimal solution

What is SageMaker Autopilot? What is the 'auto' part and what do you have to do?

A wrapper on a different product called AutoML. You load the data from S3 and say which column you would like to predict and the data pre-processing, algorithm selection, hyperparameter tuning, infrastructure provisioning is all handled completely for you.

What two data formats are applicable for SageMaker Autopilot?

Tabular CSV or Parquet

What service would you integrate with SageMaker Autopilot to give some explainability to results?

SageMaker Clarify

What are the 3 different SageMaker Autopilot modes?

Hyperparameter Optimisation, Ensembling, Auto

What is SageMaker Notebooks and what is it useful for?

Jupyter Notebooks in SageMaker Studio. Allows you to share between users.

Can SageMaker Debugger trigger notifications?

Yes, sent through SNS

What is SageMaker Model Registry?

A catalog of your models where you can associate metadata with each of them.

In terms of governance, why is SageMaker Model Registry useful?

Can be used to manage approval status of a model and share models across your organisation for different uses

What are SageMaker Model Cards?

Web pages that say what your models are meant to do, what data is was trained on

What is TensorBoard?

A 3rd party tools for visualising metrics associated w/ your model as it is being trained, such as loss and accuracy. Includes graphs for visualisation

How does TensorBoard integrate with SageMaker?

Through the console or through a URL. Does require modifications to your training script, however

What is SageMaker Training Compiler?

A feature that compiles and optimises training jobs on GPU instances for you automatically

What is SageMaker Warm Pools? Why is this useful?

A feature that allows you to retain and re-use provisioned infrastructure. Means that you don't have to keep spooling up the same infrastructure over and over again

Is SageMaker Training Compiler compatible with distributed training libraries?

What is SageMaker Checkpointing?

Creates snapshots during your model training, allowing you to restart from these if something goes wrong

What is SageMaker Cluster Health Checks?

Health checking feature that checks the health of the instances in your cluster and performs actions as required e.g. replace faulty instances

Where are SageMaker Checkpoint snapshots stored?

What is distributed model parallelism?

For very large models where a maxed out single instance is not enough. You split shards of the model and optimiser states across multiple GPUs. E.g. each GPU might handle 2 layers' optimiser states

What is distributed data parallelism?

You perform distributed model parallelism (split shards of the model and optimiser states across GPUs) AND shard the trainable parameters and gradients across GPUs as well

What is an optimiser state?

The current weights of the model

At what scale of parameters is optimiser state sharding useful?

Generally for more than 1Bn parameters

What does an Elastic Fabric Adapter do?

Accelerate your training by making better use of your bandwidth - the idea is to make it equivalent to having an on-premises HPC cluster

What is a downside of using an Elastic Fabric Adapter?

Only works w/ Nvidia collective communication library. In short, you are locked in to Nvidia GPUs

Model Training, Tuning and Evaluation Flashcards

(63 cards)