Sagemaker Built-in Algorithms Flashcards

1
Q

Linear Learner

A

linear regression
can handle both regression and classification
for classification, a linear threshold is used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear Learner Input Format

A

recordIO/protobuf, csv

file or pipe mode supported

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Linear Learner Usage

A
preprocessing:
data must be normalized and shuffled
training:
choose optimization function alg
multiple models optimized in parallel
tune L1, L2 regularization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

XGBoost

A

eXtreme Gradient Boosting
boosted group of decision trees
gradient descent to minimize loss
can be used for classification and regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

XGBoost Input

A

CSV, libsvm

recently recordIO/protobuf, Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

XGBoost Usage

A

Models are serialized/deserialized with Pickle
can be used within notebook or as a built in SM algorithm

HPs: Subsample, eta, gamma, alpha, lambda

Only uses CPUs, only memory bound

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Seq2Seq

A

Input is a sequence of tokens, output is a sequence of tokens
good for machine translation, text summarization, speech to text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Seq2Seq Input

A

recordIO/protobuf - tokens must be integers
start with tokenized text files
NEED TO PROVIDE TRAINING DATA, VALIDATION DATA, AND VOCAB FILES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Seq2Seq Usage

A

Training can take days
Pretrained models available
Public training datasets available for specific translation tasks

HPs: batch, optimizer, # layers
can optimize on accuracy, BLEU score, perplexity

only uses single machine GPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

DeepAR

A

forcasting 1D time-series data
uses RNNs
allows you to train the same model on several related time series
finds frequency and seasonality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

DeepAR Input

A

JSON lines (gzip or parquet)
each record must contain: start, target
can contain dynamic/categorical features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

DeepAR Usage

A
  • always include entire time series
  • uses entire dataset, remove last points for training
  • don’t use very large values for prediction length
  • train on many time series’ when possible

HPs: epochs, batch size, learning rate, # cells, context length

GPU or CPU for training, CPU only for inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

BlazingText

A
  1. Text Classification
    predict labels for a sentence (NOT DOCS)
    supervised
    ex. web search, info retrieval
  2. Word2Vec
    - vector representation of words
    - semantically similar words represented by vectors
    close to each other > word embedding
    - useful for NLP, but not an NLP algorithm
    - only works on INDIVIDUAL words
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

BlazingText Input

A
  1. Text Class. (supervised mode)
    - 1 sentence / line
    - 1st word in sentence is label “label”
    - augmented manifest text format
  2. Word2Vec
    - text file with 1 sentence / line
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

BlazingText Usage

A

Word2Vec has multiple modes:

- cbow > continuous bag of words (order doesn't matter)
- skip-gram (order matters)
- batch skip-gram (distributed over CPU nodes)

HPs:

  • Word2Vec: mode, learning rate, window size, vector dim, negative samples
  • Text Classification: epochs, learning rate, word n-grams, vector dim (how many words we look at together)

cbow and skipgram use GPU (can use CPU)
batch skipgram use single or multiple CPU
Text class - use CPU for smaller, GPU for larger

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Object2Vec

A
  • like Word2Vec but with arbitrary objects
  • boil data down to lower level representation
    • compute nearest neighbors, visualize clusters, genre prediction, recommendations
  • UNSUPERVISED
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Object2Vec Input

A
  • tokenized into integers
  • pairs or sequences of tokens
    • sentence-sentence, labels-sequence, customer-customer, product-product, user-item
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Object2Vec Usage

A
  • process data into JSON and shuffle
  • train with 2 input channels, 2 encoders, 1 comparator
  • encoder choices:
    • average pooled embeddings, CNN, bidirectional LSTM
  • comparator is followed by a feed-forward neural network

HPs: usual deep learning ones:

- dropout, early stopping, epochs, learning rate, batch size, layers, activation function, optimizer, weight decay - encoder1 network, encoder2 network

single machine, multi GPU
use inference pref mode to optimize for encoder embeddings rather than classification or regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Object Detection

A
  • ID all images in an image with bounding boxes
  • detect and classify with one deep neural network
    • provide confidence scores
  • can train images from scratch, or use pretrained models on ImageNet
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Object Detection Input

A
  • recordIO or image format (need JSON file for annotation data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Object Detection Usage

A
  • image&raquo_space; outputs all instances of all objects in image with categories and confidence scores
  • CNN with SSD algorithm
    • VGG-16 or ResNet 50
  • transfer learning mode / incremental training
    • use pretrained model for base network weights instead of random initial rates
  • uses flip, rescale, and jitter to avoid overfitting

HPs: batch size, learning rate, optimizer

GPU for training, CPU for inference

22
Q

Image Classification

A
  • assign one or more labels to an image

- doesn’t tell you where the objects are (no bounding)

23
Q

Image Classification Input

A
  • MxNet Record10 (not protobuf!)
  • raw images
    • requires first file to associate image with labels
  • augmented manifest image format - pipe mode!
24
Q

Image Classification Usage

A
  • ResNet CNN
  • full training > initialized with random weights
  • transfer learning mode:
    • initialized with pretrained weights
    • top layer is initialized with random weights
    • network is fine-tuned with new training data
  • default size is 3channel 224*224

HPs: batch, learning rate, optimizers (weight decay, beta1, beta2, eps, gamma)

GPU for training, GPU or CPU for instance

25
Semantic Segmentation
- pixel-level object classification - useful for self-driving cars - produces a segmentation mask
26
Semantic Segmentation Input
- JPG images and PNG annotations - label maps for describing annotations - augmented manifest image format for pipe! - JPG for inference
27
Semantic Segmentation Usage
- built on MxNet Gluon and GluonCV - choice of 3 algorithms: - fully-convolutional network (FCN) - pyramid scene parsing (PSP) - DeepLabV3 - backbone: ResNet50, ResNet10 - both trained on ImageNet HPs: epochs, learning rate, batch size, optimizer, algorithm used, backbone used single machine GPU only, CPU or GPU for inference
28
Random Cut Forest
- unsupervised anomaly detection - detect - spikes in time-series data - breaks in periodicity - unclassifiable data points - gives anomaly score to each point - amazon very proud of this!
29
Random Cut Forest Inputs
- CSV or recordIO/protobuf - file or pipe - optional test channel for computing AUC, recall, precision, F1 score
30
Random Cut Forest Usage
- create forest of trees where each tree is a partition of the training data - looks at expected change in complexity as a result of adding a new point - sampled randomly, then trained - can be used on time series HPs: number of trees (increasing # reduces noise), samples / tree no GPU
31
Neural Topic Model
- organize documents into topics - classify/summarize documents based on topics - not just TF/IDF - NTM groups things into higher levels - unsupervised - uses a neural variational inference algorithm
32
Neural Topic Model Input
- four data channels - train channel required (validation, test, aux optional) - recordIO/protobuf or CSV - words need to be tokenized with a vocab file - file or pipe mode
33
Neural Topic Model Usage
- define how many topics to generate - latent representation based on top-ranking words - one of two topic modeling algorithms (LDA) HPs: smaller batch size and learning rate can reduce validation loss but increase training time, # of topics CPU or GPU
34
Latent Dirichlet Allocation (LDA)
- topic modeling (not deep learning) - unsupervised - grouping of documents with shared subset of words - can be used for things other than words - customer clusters, harmonic analysis
35
LDA Input
- train, optional test channel - recordIO/protobuf or CSV - need to tokenize - each document has counts for every word in vocab (CSV) - pipe only supported with recordIO
36
LDA Usage
- unsupervised > you pick the # of topics - test channel - score results - functionally similar to Neural Topic Modeling, but CPU based HPs: # of topics, Alpha0 (initial guess for concentration values) single instance CPU
37
KNN
- supervised - simple classification or regression algorithm - classification: - find K closest points to a sample and return most frequent label - regression: - find K closest points to a sample and return average value
38
KNN Input
- train, optional test channel - recordIO/protobuf or CSV - file or pipe
39
KNN Usage
- data is sampled - dimensionality reduction - avoid sparse data at the cost of noise/accuracy - sign or figit methods - build index - serialize - query HPs: K, sample size CPU or GPU inference - CPU for lower latency, GPU for higher throughput
40
K-Means
- unsupervised clustering - divide data into K groups where members are similar - you define "similar" > euclidian distance - web-scale k-means clustering
41
K-means Inputs
- train channel (sharded by S3 key flag), optional test (fully replicated key flag) - recordIO/protobuf or CSV - file and pipe
42
K-Means Usage
- every observation mapped to n-dimensional space - works to optimize center of K-clusters - extra cluster centers may be specified to improve accuracy - K = k*x - k = clusters we want - x = extra cluster centers - algorithm: determine initial cluster centers - random or K-means ++ - K-means ++ tries to make initial clusters far apart - iterate over data and calculate cluster centers - reduce clusters from K to k (using Lloyd's method for k-means++) HPs: batch size, extra center factor (x), init method (random or k-means++), K - K is tricky: use elbow method - basically optimize for tightness of clusters CPU or GPU (CPU recommended)
43
Principal Component Analysis (PCA)
- dimensionality reduction - projecting higher-level dimensional data into lower-dimensional (like a 2D plot) while minimizing loss of info - reduced dimensions are called components - first component has largest possible variability - 2nd component has next largest - unsupervised
44
PCA Inputs
- recordIO/protobuf | - file or pipe
45
PCA Usage
- covariance matrix created, then singular value decomposition (SVD) - 2 modes: - regular - sparse data, moderate # of features - randomized - large # of features - uses approximation algorithm HPs: algorithm mode, subtract mean (unbias data) CPU or GPU - depends on specifics of data
46
Factorization Machines
- classification/regression with SPARSE DATA - good for recommendations - click prediction - item recommendations - since a user doesn't interact with most pages/products, the data is sparse - supervised (classification or regression) - limited to pair-wise interactions - user - item
47
Factorization Machines Inputs
- recordIO/protobuf with Float32 | - sparse data means CSV isn't practical
48
Factorization Machines Usage
- essentially makes a big matrix - find factors we can use to predict a classification (click or not) or value (predicted rating) given a matrix representing some pair of things (users and items) HPs: initialization methods for bias, factors, and linear terms - uniform, normal, or constant CPU or GPU - CPU recommended, GPU only works with dense data
49
IP Insights
- unsupervised - learning of IP address usage patterns - ID suspicious activity - security tool
50
IP Insights Inputs
- user names, account IDs can be fed in directly - training channel, optional validation (computes AUC) - CSV only - entity, IP
51
IP Insights Usage
- neural network to learn about latent vector representations of entities and IP - entities hashed and embedded - need big enough hash size - auto generates negative samples by randomly pairing entities and IPs HPs: # of entity vectors (hash size), vector dim, epochs, learning rate, batch size CPU or GPU (GPU recommended)