Models Flashcards
(86 cards)
Linear Learner - Instance types
Single or multi-machine CPU or GPU
Multi-GPU does not help
Linear Learner - Hyperparams
Balance_multiclass_weights → gives each class equal important in loss functions
Learning rate
Mini_batch_size
L1 regularisation
Wd = weight decay = L2 regularisation
Linear Learner - Model types
Can handle both regression (numeric) predictions and classification problems
For classification, a linear threshold function is used
Can do binary or multi-class problems
Linear Learner - Input format
Record IO-wrapped protobuf → Float-32 data
CSV → first column is the label
File or Pipe mode both supported
Linear Learner - Pre-processing
Training data should be normalised (so all features are weighted the same)
Linear learner can do this for you
Input data should be shuffled
Linear Learner - Training
Uses SGD
Choose an optimisation algo (Adam, Adagrad, SGD, etc)
Multiple models are optimised in parallel and chooses most optimal in validation step
Tune L1, L2 regularisation
Linear Learner - Validation
Most optimal model is selected
XGBoost - Model Type
eXtreme FGradient Boosting
Boosted group of decision trees
New trees made to correct the errors of previous trees
Uses gradient descent to minimise loss as new trees are added
Can be used for:
Classification
Regression (uses regression trees)
Can use it:
Within notebook as sagemaker.xgboost
Or use sagemaker container
XGBoost - Input
CSV
Libsvm
recordIO-protobuf
Parquet format
XGBoost - Hyperparameters
Sub_sample → prevent overfitting
Eta → step size shrinkage → prevents overfitting
Gamma → minimum loss reduction to create a partition, larger = more conservative
Alpha = L1 regularisation term; larger = more conservative model
Lambda = L2 regularisation term; larger = more conservative model
Eval_metric = optimise on AUC, error, rmse if you’re optimising on accuracy. However, for focusing on false positives, you might set this to AUC
Scale_pos_weight:
-Adjusts balance of positive and negative weights
-Helps for unbalanced classes
-Might set to sum(negative cases)/sum(positive cases)
Max_depth = max depth of tree → too high and you might overfit
XGBoost - instances
Uses CPUs for multiple instance training
Memory-bound → not compute bound
–> So M5 is a good choice for multiple instance
If using 1 instance
As of XGBoost 1.2, single instance GPU training is available
E.g P2 or P3 instance types
–> Must set tree_method hyperparameter to gpu_hist
–> Trains more quickly → can be more cost effective
In XGBoost 1.2.2
P2, P3, G4dn, G5
Seq2Seq - Model type
Input is a sequence of tokens, output is a sequence of tokens
Uses:
Machine translation
Text summarisation
Speech to test
Implemented with RNNs and CNNs with attention
Seq2Seq - Inputs
recordIO-protobuf → tokens must be integers (this is unusual, since most algorithms want floating point data)
Start with tokenised text files
Convert to protobuf using sample code
- Packs into integer tensor with vocab files
- A lot like TF/IDF
Must provide training data, validation data, and vocabulary files
Seq2Seq - training
Can take days to train
Pre-trained models are available → see example notebook
Public training datasets are available for specific translation tasks
Seq2Seq - Hyperparameters
Batch_size
Optimise_type (adam, sgd, rmsprop)
Learning_rate
Num_layers_encoder, num_layers_decoder
Can optimise on:
- Accuracy
– Vs provided validation dataset - BLEU score
– Good for machine translation
– Compares against multiple reference translations - Perplexity
– Good for machine translation
– Cross entropy
Seq2Seq - Instances
Only GPU e.g. P3
Can only use a single machine for training → but can use multiple GPUs on a single machine → but can’t be parallelized across multiple machines
DeepAR - Model Type
Forecasting one-dimensional time-series data
- Allows you to train the same model over several related time series
- Finds frequencies and seasonality
Uses RNNs
DeepAR - Input
JSON lines format → in GZIP or Parquet for better performance
Each record must contain:
start: the starting time stamp
Target: the time series values
Each record can contain:
Dynamic features e.g. was a promotion applied to the product during the time series + product purchases
Categorical features
DeepAR - How is it used?
Always include the entire series for training, testing, and inference:
Use entire dataset as a test set, remove last time points for training → evaluate on values
Don’t use large values for predictions length (>400) → can’t do too far into the future
Train on many time series and not just one when possible
DeepAR - Hyperparameters
Context_length = number of time points the model sees before making a prediction
Can be smaller than seasonalities → the model will lag one year anyhow
Epochs
Mini_batch_size
Learning_rate
Num_cells = number of neurons
DeepAR - Instances
CPU or GPU
Single or multi machine
Recommendation: start with CPU (ml.c4.2xlarge then ml.c4.4xlarge)
Move up to GPU if necessary
with large mini-batch-size or with larger models
May need larger instances for tuning
BlazingText - Model Type
Only for sentences → not entire documents
Text classification:
Predicts labels for a sentence
Useful in web searches, information retrieval
Supervised
Word2Vec:
Creates a vector representation of words
Semantically similar words are represented by vectors close to each other
This is called word embedding
It is useful in NLP, but is not an NLP algo itself
Used in machine translation, sentiment analysis
BlazingTest - Input
For supervised mode (text classification)
One sentence per line
First “word” in the sentence is the string __label__ followed by the label e.g. “__label__4 hello there this is a sentence”
Also “augmented manifest text format” –> json string
Source and label field
Word2Vec just wants a text file with one training sentence per line
BlazingText - modes of Word2Vec?
Word2vec has multiple modes:
CBow (continuous bag of words → order of words is thrown out, just the words themselves matter)
Skip-gram
Batch skip-gram → distributed computation over many CPU modes