Models Flashcards

(86 cards)

1
Q

Linear Learner - Instance types

A

Single or multi-machine CPU or GPU
Multi-GPU does not help

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear Learner - Hyperparams

A

Balance_multiclass_weights → gives each class equal important in loss functions
Learning rate
Mini_batch_size
L1 regularisation
Wd = weight decay = L2 regularisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Linear Learner - Model types

A

Can handle both regression (numeric) predictions and classification problems
For classification, a linear threshold function is used
Can do binary or multi-class problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Linear Learner - Input format

A

Record IO-wrapped protobuf → Float-32 data
CSV → first column is the label
File or Pipe mode both supported

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Linear Learner - Pre-processing

A

Training data should be normalised (so all features are weighted the same)
Linear learner can do this for you
Input data should be shuffled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Linear Learner - Training

A

Uses SGD
Choose an optimisation algo (Adam, Adagrad, SGD, etc)
Multiple models are optimised in parallel and chooses most optimal in validation step
Tune L1, L2 regularisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Linear Learner - Validation

A

Most optimal model is selected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

XGBoost - Model Type

A

eXtreme FGradient Boosting
Boosted group of decision trees
New trees made to correct the errors of previous trees
Uses gradient descent to minimise loss as new trees are added
Can be used for:
Classification
Regression (uses regression trees)

Can use it:
Within notebook as sagemaker.xgboost
Or use sagemaker container

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

XGBoost - Input

A

CSV
Libsvm
recordIO-protobuf
Parquet format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

XGBoost - Hyperparameters

A

Sub_sample → prevent overfitting
Eta → step size shrinkage → prevents overfitting
Gamma → minimum loss reduction to create a partition, larger = more conservative
Alpha = L1 regularisation term; larger = more conservative model
Lambda = L2 regularisation term; larger = more conservative model
Eval_metric = optimise on AUC, error, rmse if you’re optimising on accuracy. However, for focusing on false positives, you might set this to AUC
Scale_pos_weight:
-Adjusts balance of positive and negative weights
-Helps for unbalanced classes
-Might set to sum(negative cases)/sum(positive cases)
Max_depth = max depth of tree → too high and you might overfit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

XGBoost - instances

A

Uses CPUs for multiple instance training
Memory-bound → not compute bound
–> So M5 is a good choice for multiple instance
If using 1 instance

As of XGBoost 1.2, single instance GPU training is available
E.g P2 or P3 instance types
–> Must set tree_method hyperparameter to gpu_hist
–> Trains more quickly → can be more cost effective

In XGBoost 1.2.2
P2, P3, G4dn, G5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Seq2Seq - Model type

A

Input is a sequence of tokens, output is a sequence of tokens

Uses:
Machine translation
Text summarisation
Speech to test

Implemented with RNNs and CNNs with attention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Seq2Seq - Inputs

A

recordIO-protobuf → tokens must be integers (this is unusual, since most algorithms want floating point data)

Start with tokenised text files

Convert to protobuf using sample code
- Packs into integer tensor with vocab files
- A lot like TF/IDF
Must provide training data, validation data, and vocabulary files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Seq2Seq - training

A

Can take days to train
Pre-trained models are available → see example notebook
Public training datasets are available for specific translation tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Seq2Seq - Hyperparameters

A

Batch_size
Optimise_type (adam, sgd, rmsprop)
Learning_rate

Num_layers_encoder, num_layers_decoder

Can optimise on:

  • Accuracy
    – Vs provided validation dataset
  • BLEU score
    – Good for machine translation
    – Compares against multiple reference translations
  • Perplexity
    – Good for machine translation
    – Cross entropy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Seq2Seq - Instances

A

Only GPU e.g. P3
Can only use a single machine for training → but can use multiple GPUs on a single machine → but can’t be parallelized across multiple machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

DeepAR - Model Type

A

Forecasting one-dimensional time-series data
- Allows you to train the same model over several related time series
- Finds frequencies and seasonality

Uses RNNs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

DeepAR - Input

A

JSON lines format → in GZIP or Parquet for better performance

Each record must contain:
start: the starting time stamp
Target: the time series values

Each record can contain:
Dynamic features e.g. was a promotion applied to the product during the time series + product purchases
Categorical features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

DeepAR - How is it used?

A

Always include the entire series for training, testing, and inference:
Use entire dataset as a test set, remove last time points for training → evaluate on values

Don’t use large values for predictions length (>400) → can’t do too far into the future

Train on many time series and not just one when possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

DeepAR - Hyperparameters

A

Context_length = number of time points the model sees before making a prediction
Can be smaller than seasonalities → the model will lag one year anyhow

Epochs

Mini_batch_size

Learning_rate

Num_cells = number of neurons

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

DeepAR - Instances

A

CPU or GPU
Single or multi machine
Recommendation: start with CPU (ml.c4.2xlarge then ml.c4.4xlarge)
Move up to GPU if necessary
with large mini-batch-size or with larger models
May need larger instances for tuning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

BlazingText - Model Type

A

Only for sentences → not entire documents

Text classification:
Predicts labels for a sentence
Useful in web searches, information retrieval
Supervised

Word2Vec:
Creates a vector representation of words
Semantically similar words are represented by vectors close to each other
This is called word embedding
It is useful in NLP, but is not an NLP algo itself
Used in machine translation, sentiment analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

BlazingTest - Input

A

For supervised mode (text classification)
One sentence per line
First “word” in the sentence is the string __label__ followed by the label e.g. “__label__4 hello there this is a sentence”

Also “augmented manifest text format” –> json string
Source and label field

Word2Vec just wants a text file with one training sentence per line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

BlazingText - modes of Word2Vec?

A

Word2vec has multiple modes:
CBow (continuous bag of words → order of words is thrown out, just the words themselves matter)
Skip-gram
Batch skip-gram → distributed computation over many CPU modes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
BlazingText - Hyperparameters
Word2vec: Mode (batch_skipgram, skipgram, cbow) Learning_rate Window_size Vector_dim Negative_samples Text classification: Epochs Learning_rate Word_ngrams Vector_dim
26
BlazingText - Instance types
Word2Vec: For cbow and skipgram, recommend a single: ml.p3.2xlarge --> Any single CPU or GPU will work For batch skipgram, can use single or multiple CPU instances For text classification: C5 is recommended for less than 2GB training data. For larger data set, use a single GPU instance (ml.p2.xlarge or ml.p3.2xlarge)
27
Object2Vec - Model type
Word2vec → finds relationships between words in a sentence Object2Vec → can work on entire document, or other objects Creates a low dimensional dense embeddings on high dimensional objects - Compute nearest neighbour of objects - Visualise clusters Use cases: Genre predictions Recommendation (similar items or users)
28
Object2Vec - Input
Data must be tokenised into integers Training data consist of pair of tokens and/or sequence of tokens: Sentence-sentence Labels-sentence (genre to description?) Customer-customer Product-product User-user
29
Object2Vec - How is it used?
Process data into JSON lines and shuffle it Train with two input channels, two encoders, and a comparator Encoder choices: Average-pooled embeddings CNN Bidirectional LSTM Comparator followed by a feed forward neural network
30
Object2Vec - Hyperparameters
Usual deep learning: Dropout Early stopping Epochs Learning rate Batch size layer Activation fns Optimiser weight decay Enc1_network and enc2_network: Choose cnn, bilstm, pooled_embedding → choose encoder type for each channel
31
Object2Vec - Instances
Can train on only a single instance (CPU< GPU, or multi GPU): Start with CPU: ml.m5.2xlarge Ml.p2.xlarge If needed go up to ml.m5.4xlarge, ml.m5.12slarge GPU: P2, P3, G4dn, G5 Inference: ml.p3.2xlarge Use INFERENCE_PREFERED_MODE env var to optimise for encoder embeddings rather than classification or regression
32
Object Detection - Model Type
Identify all objects in an image with bounding boxes Detects and classifies objects with a single deep neural network Classes accompanied by confidence scores Can train from scratch or use pre-trained models based on MXNet
33
Object Detection - How is it used and types?
Two variants: MXNet and Tensorflow Takes an image as input, outputs all instances of objects in the image, with categories and confidence scores Mxnet: Uses a CNN with single shot multibox detector (SSD) algo Transfer learning model / incremental learning Uses flip, rescale, and jilter internally to avoid overfitting Tensorflow: Uses resnet, efficient net, mobilenet modes
34
Object Detection - Input
MXNet: - recordIO or image format (jpg or png) With image format, supply a json with annotation data for each image
35
Object Detection - Hyperparameters
Batch size Learning rate Optimiser --> Sgd, adam, rmsprop
36
Object Detection - Instance
Use GPU for training - can do multi and multi-machine Ml.p2, ml,p2, G4dn, and G5 Inference CPU or GPU: M5, P2, P3, G4dn
37
Image Classification - Model type
Object detection tells you where an object is Image Classificaiton tells you what is in the image Assign one or more labels to an image Doesn’t tell you where objects are, just what is it
38
Image Classification - How is it used? Different Types?
Separate algos for mxnet and tensorflow Mxnet: 1. Fulltraining mode --> Network initialised with random weights 2. Transfer learning --> Pre-trained weights --> The top fully connected layer is initialised with random weights --> Network is fine tuned with new training data Default image size is 3-channel 224x224 (RGB) Tensorflow → uses various tensorflow hub models (mobilenet, inception, resnet, efficientnet) → Top classification layer is available for fine tuning and further training
39
Image Classification - Hyperparameters
Usual deep learning: Batch size, learning rate, optimiser Optimiser specific Weight decay, beta 1, beta 2, eps, gamma Slight difference between mxnet and tensorflow
40
Image Classification - Instances
GPU for training (multi GPU, and multi instances P2, p3, g4dn, g5 CPU or GPU for Inference M5, p2, p3, g4dn, g5
41
Semantic Segmentation - Model type
Pixel-level object classification: Rather than just a bounding box Shows you EXACTLY where the object is Useful for self-driving vehicles, medical diagnostics, robot sensing Produces a segmentation mask
42
Semantic Segmentation - How is it used?
Built on mxnet and Gluon CV Choice of 3 algos (decoders --> constructs segmentation mask): Fully conv net (FCN) Pyramid scene Parsing (PSP) DeepLabV3 Choice of backbones (or encoder --> applies activation fn to features): Resnet50, resnet101, both trained on imagenet Incremental training, or training from scratch, both supported
43
Semantic Segmentation -Training input
JPG images and PNG annotation For both training and validation Label maps to describe annotations Augmented manifest image format supported for Pipe mode
44
Semantic Segmentation - Inference Input
JPEG image
45
Semantic Segmentation - Hyperparameters
Epochs, learning rate, batch size, optimiser etc Algorithms Backbones
46
Semantic Segmentation - Instance
Only GPU for training: P2, P3, G4dn, G5 Only single instance Instance CPU (C5 or M5) or GPU (P3 or G4dn)
47
Random cut forest - Model type
Anomaly detection: Unsupervised Detect unexpected spikes in time series Breaks in periodicity Unclassified data points Assigns an anomaly score to each data points Based on an algo AWS made
48
Random cut forest - training input
RecordIO-protobuf or csv Can use file or pipe mode Optional test channel for computing accuracy, precision. Recall etc → on something where you know where the anomalies are
49
Random cut forest - how is it used?
Creates a forest of trees where each tree is a partition of the training data → looks at expected change in complexity of the tree as a result of adding a point to it Data is sampled randomly Then trained RCF shows up in Kinesis Analytics as well → anomaly detection on streaming data
50
Random cut forest - Hyperparams
Number of trees Increase → reduces noise Num samples per tree Should be chosen such that 1/num_samples_per_tree approaches the rate of anomalous to normal data
51
Random cut forest - Instances
No GPUS Use m4, c4, c5 for training C5 for inference
52
Neural Topic Model - Model type
What is a document about? Unsupervised Natural variational inference Organise documents into topics Classify or summarise documents based on topics --> Not just TF-IDF --> Won’t return topic name, but will groups docs
53
Neural Topic Model - input
Four data channels: “Train” is required Validation, test and auxiliary are optional Recordio-protobuf or csv Words must be tokenized into integers Every doc must contain a count for every word in the vocabulary in CSV The auxiliary channel is the vocabulary, mapping tokens to words File or pipe mode
54
Neural Topic Model - How is it used?
You define how many topics you want These topics are a latent representation based on top ranking words Topics will not be human readable words One of 2 topics modelling algos in sagemaker
55
Neural Topic Model - Hyperparams
Batch size and learning rate: Can reduce validation loss, at expense of training time Num_topics
56
Neural Topic Model - Instances
GPU or CPU GPU for training CPU adequate for inference
57
LDA - how is it used?
Unsupervised: generates however many topics you specify Optional test channels can be used for scoring results Per word log likelihood shows how well it works Functionality similar to NTM, but CPU based Therefore much cheaper / efficient
58
LDA - Model type
Sagemakers other topic modelling algo Latent dirichlet allocations Unsupervised Topics themselves are unlabeled → just groupings of documents with a shared subset of words NTM is another unsupervised topic identification algo Not deep learning Can be used for things other than words: Cluster customers based on purchases Harmonic analysis in music
59
LDA - Inputs
Train channel, optional test channel Redordio-protbuf or csv Each document has counts for every word in vocab for that document Pipe mode only supported with recordio
60
LDA - Hyperparams
Num topics Alpha0: Initial guess for concentration parameter Smaller values generate sparse topic mixtures Larger values (>10) produce uniform mixtures
61
LDA - Instances
Single CPU instance
62
K-Nearest Neighbours KNN - Model Type
Simplification classification or regression algo Technically supervised → labelled Classification: Find the k closest point to a sample and return the most frequent label Regression: Find the k nearest neighbours and return the average value
63
K-Nearest Neighbours KNN - How is it used
Data is first sampled Sagemaker includes a dimensionality reduction stage: Avoid sparse data (curse of dimensionality) At cost of noise/accuracy Optionas: Sign or fjit methods Builds an index for looking at neighbours Serialise the model Query the model for a given k
64
K-Nearest Neighbours KNN - Inputs
Train channel contains your data Test channel emits accuracy or MSE recordIO-protbuf or csv Csv: first column contains label Pipe or file mode
65
K-Nearest Neighbours KNN - Hyperparams
K Sample_size
66
K-Nearest Neighbours KNN - Instances
Training on cpu or gpu: M5 or p2 Inferences CPU for lower latency Gpu for higher throughput on large batches
67
K-means - Model type
Unsupervised clustering Divide data into k groups, where members of a group are as similar to each other as possible You define what similar means Measured by euclidean distance Web-scale k-means clustering
68
K-means - Input
Train channel, optional test: Train SharedByS3Key, test FullyReplicated RecordIO-protobuf or CSV File or Pipe mode
69
K-means - How is it used?
Every observation to n-dimensional space (n=number fo features) Works to optimise the centre of K clusters" --> “Extra cluster centres” may be specified to improve accuracy (which end up getting reduced to k) --> K=k*x → K is the initial number of clusters, want to reduce this down to k Algorithm: Determine initial cluster centres - Random or k-mean++ approach - K-means++ tries to make initial clusters far apart Iterate over training data and calculate cluster centres Reduce clusters from K to k - Using Lloyds method with kmeans++
70
K-means - Hyperparams
K: Chosing K is tricky Pilot within cluster sum of squares as a function of k Use “elbow emthod” Basically optimise for tightness of clusters Batch size Extra centre factor Init method
71
K-means - instances
CPU or GPU, but cpu recommended Only one GPU per instance used on GPU → g4dn if going to GPU P2, p3, g4dn, and g4 supported
72
PCA - Model type
Principal component analysis Dimensionality reduction: - Project higher-dimensionality data (lots of features) into lower dimensional space while minimising the loss of information - The reduced dimensions are called components - First component has largest possible variability - Second component has the next largest - Unsupervised
73
PCA - How is it used?
Covariance matrix is created, then SVD (single value decomposition) Two modes: Regular --> For sparse data and moderate number of observations and features Randomised --> For large number of observations and features Uses approximation algorithm
74
PCA - Input
Recordio-protobuf or csv File or pipe mode
75
PCA - Hyperparams
Algortihm_mode Subtract_mean → unbiased the data
76
PCA - Instance
GPU or CPU It depends on the specifics of the input data → need to experiment
77
Factorisation machines - model type
Dealing with sparse data: Click prediction (= individual user does not interact with majority pages on a website, but they do interact with a few pages) Item recommendations Since an individual user doesn’t interact with most pages / products the data is sparse Supervised Classification or regression Limited to pair-wise interactions: User → item for example
78
Factorisation machines - Input
Redcordio-protobuf format with float32 Sparse data means csv isn’t practical → loads of commas
79
Factorisation machines - How is it used?
Find factors we can use to predict a classification (click or not? Purchase or not?) or value (predicted rating?) given a matrix representing some pair of things Usually used in the context of recommender systems
80
Factorisation machines - hyperparams
Initialisation methods for bias, factors and linear terms - Uniform, normal, or constant - Can tune properties of each method
81
Factorisation machines - Instances
Cpu or gpu Cpu recommended Gpu only works with dense data
82
IP Insights - Model type
Unsupervised learning of ip address usage patterns Identifies suspicious behaviour from ip address Identify logins from suspicious ip addresses Identify accounts creating resources from anomalous ips
83
IP Insights - Input
User names, accounts IDs can be fed in directly, no need to preprocess Training channel, optional validation (computes AUC score) CSV only → entity, IP
84
IP Insights - How is it used?
Uses a neural network to learn latent vector representation of entities and ip addresses Entities are hashed and embedded: Need sufficiently large hash size Automatically generates negative samples during training by randomly pairing entities and ips
85
IP Insights - Hyperparams
Num entity vectors: Hash size Set to twice the number of unique entity identifiers Vector dim: Size of embedding vectors Scales model size Too large results in overfitting Epochs, learning rate, batch size etc
86
IP Insights - Instances
CPU or GPU Gpu recommended e.g. p3 or higher Can use multiple GPUs Size of CPU depends on vector dim and number of vectors