Modeling Flashcards

Question

Learning Rate and Training

Answer 1

Small LR will increase the training time Large LR can overshoot the correct solution

Answer 2

it prevents overfitting

Answer 3

try simpler model try fewer neurons try fewer layers Dropout: - remove some neurons at random at each training set to force the model to spread itself and learning more Early Stopping: - on the point that accuracy goes high but validation accuracy not

Answer 4

Opposite of Exploding Gradients Vanishing Gradient is when the slope of the learning curve approaches zero

Answer 5

Multi-level hierarchy - train sub-networks instead of the whole network LSTM Residual Network - ResNet, for object recognition Better choices of Activation Function - ReLu

Answer 6

a debugging technique Numerically check the derivatives computed during training Useful for validating code of neural network

Answer 7

L1 is sum of the weights L2 is sum of square of the weights to prevent over fitting

Answer 8

L1: Sum of weights - performs feature selection - Computationally inefficient - sparse output L2: Sum of square of weights - All features remain considered. just weighted - computationally efficient - Dense output

Answer 9

Feature selection reduces the dimensionality - out of 100 features, maybe only 10 endup with non-0 coefficients - resulting sparsity can make up for its computational inefficiency on the other side, if you think all the features are important, then go for L2

Answer 10

Predicted Yes, Actual Yes - True Positive Predicted Yes, Actual No - False Positive Predicted No, Actual Yes - False Negative Predicted No, Actual No - True Negative

Answer 11

including a heat map it's useful for multi-class classification

Answer 12

TP / TP+FP Captured over Number of nominated AKA - Percent of relevant results - Correct Positives when FP are important e.g. Medical screening, drug testing

Answer 13

TP / TP+FN Captured over came AKA: - Sensitivity, TP rate, Completeness - % of negatives wrongly predicted Good for when FN is critical - e.g. Fraud Detection

Answer 14

2TP / 2TP+FP+FN 2 * (Precision*Recall)/(Precision+Recall) Harmonic mean of precision and sensitivity when you care about precision and recall

Answer 15

TN / TN+FP | True Negative rate

Answer 16

Root mean squared error Accuracy measurement Only care about right and wrong answers

Answer 17

Receiver Operating Characteristic Curve TP vs FP at various threshold setting points above diagonal represent good classification better than random the more it's bent toward upper-left the better

Answer 18

Receiver Operating Characteristic Curve TP vs FP at various threshold setting points above diagonal represent good classification better than random the more it's bent toward upper-left the better

Answer 19

Ensemble model takes multiple model and they might be just variations of the same model and lets them all vote on the final result Bagging and Boosting

Answer 20

yes they are!

Answer 21

Generate multiple training sets by random sampling with replacement each resampled model can be trained in parallel they end up being a more robust than a single model

Answer 22

works in a serial manner vs the parallel bagging it assigns weights to each observation to a dataset Training is sequential starting with equal weight for each observation

Answer 23

XGBoost is one hot algorithm today

Answer 24

avoid overfitting having a regularization effect Bagging is easier to parallelize

Answer 25

RecordIO | Protobuf

Answer 26

yes you can

Answer 27

to deploy to Edge Devices

Answer 28

Linear Regression - Numeric Prediction - Classification (binary/multi-class)

Answer 29

Performant options: - RecordIO-wrapped protobuf (float32 only) CSV - First column assumed to be header File or Pipe mode both supported

Answer 30

File mode copy data to the fleet Pipe will stream required data only, that's why it's more efficient

Answer 31

Normalized - all features weighted the same - Linear Maker can do this optionally Shuffle the data

Answer 32

Uses SGD - optimization algorithms: Adam, AdaGrad, SGD, etc Multiple models are optimized in parallel Tune L1, L2 regularization

Answer 33

most optimal model is selected

Answer 34

``` Balance_multiclass_weights - gives each class equal importance in loss functions ``` Learning_rate, mini_batch_size L1 - Regularization Wd - Weight decay (L2 regularization)

Answer 35

Single or multi-machine CPU/GPU Multi-GPU does not help in this case

Answer 36

eXtreme Gradient Boosting - Boosted group of decision trees - New trees made to correct the errors of previous trees - Uses gradient descent to minimize loss as new trees are added

Answer 37

on Kaggle it is talk of the town | and it is also fast (not resource intensive)

Answer 38

Both it does regression as well using regression trees

Answer 39

CSV or Libsvm no Protobuf here

Answer 40

models are serialized/deserialized with pickle can use a framework within notebooks - sagemaker.xgboost or as a built-in algorithm

Answer 41

Subsample - Prevent overfitting Eta - Step size shrinkage, prevents overfitting Gamma - Minimum loss reduction to create a partition - larger value = more conservative Alpha - L1 regularization term - Larger value= more conservative Lambda - L2 regularization term - Larger = more conservative

Answer 42

CPU only in-memory bound, not compute bound M4 is a good choice

Answer 43

input sequence of tokens output sequence of tokens Machine Translation Text Summarization Speech to text Implemented with RNN's and CNN's with attention

Answer 44

RecordIO-Protobuf - Tokens must be integers (yes others mostly want floating point data) Start with Tokenized text files Convert to Protobuf using sample code - Packs into integer tensors with vocabulary files - a lot like the TF-IDF lab Must provide: - Training Data - Validation Data - Vocabulary files

Answer 45

Yay there are many also public training dataset are available for specific translation tasks.

Answer 46

Batch_size Optimizer_type: - adam - sgd - rmsprop Learning_rate Num_layers_encoder/decoder Can optimize on: - Accuracy: vs. provided validation dataset - BLEU score: compares against multiple ref translation - Perplexity: cross-entropy

Answer 47

only GPU (e.g. P3) only one machine but it can come with multiple GPUs

Answer 48

Forecasting one-dimensional time series data uses RNN's Allows training the same model over several related time series Fins frequencies and seasonality

Answer 49

JSON lines format - GZIP or Parquet Each record must contain: - Start timestamp - Target Each record can contain: - Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a time series) - Cat: Categorical features

Answer 50

always include the entire time series for training, Testing and inference use entire dataset as test set, remove last time points for training. evaluate on withheld values. Don't use very large values for prediction length (>400) Train on many time series and not just one when possible

Answer 51

400 data points in future

Answer 52

Number of time points the model sees before making prediction can be smaller than seasonalities, the model will lag one year anyhow

Answer 53

CPU / GPU Single / Multi machine economically feasible: C4.2xlarge / C4.4xlarge move to gpu if necessary CPU-only for inference may need larger instances for tuning

Answer 54

- text classification predict labels for sentences (not documents) supervised useful in web searches, information retrieval - word2vec word embedding useful for NLP but is not an NLP algorithm Machine translation, Sentiment Analysis works on individual words, not sentences or dox

Answer 55

- Text Classification one sentence per line __label__{label} {sentence} also possible to use augmented manifest text format {"source":"","label":""} - word2vec a text file with one training sentence/line

Answer 56

- Cbow (continues bag of words) order of the words is being thrown out and just the words themselves matter - skip-gram n-gram (which order of the words does matter) - Batch skip-gram distributed of computation over many cpu nodes.

Answer 57

``` - Word2vec: mode (batch_skipgram, skipgram, cbow) learning_rate window_size vector_dim negative_samples ``` ``` - Text Classification: epoch learning_rate word_ngrams vector_dim ```

Answer 58

- cbow, skipgram ml.p3.2xlarge any single cpu or single gpu instance work - batch_skipgram single or multiple cpu instances (scale horizontally) - text classification C5 for < 2GB training data larger dataset, single GPU instance (ml.p2/3.xlarge)

Answer 59

Like word2vec but for arbitrary objects low dimensional dense embedding of high-dimensional objects compute nearest neighbors of objects Visualize clusters Genre predictions Recommendations (similar items to users) unsupervised

Answer 60

unsupervised so you don't need to train it. it can automatically figure out what similarities are based on the inherit of data within its features

Answer 61

data must be tokenized into integers consist of pairs of tokens and/or sequences of tokens - sentence - sentence - labels - sequence ( genre to description) - product - product - user - item

Answer 62

Average-pooled embeddings CNN Bidirectional LSTM

Answer 63

process data into JSON lines and shuffle it train with two input channels, two encoders and a comparator choose an encoder Comparator is followed by a feed-forward neural network

Answer 64

``` dropout early stopping epochs learning rate batch size layers activation function optimizer weight decay ``` Encoders: (hcnn, bilstm, pooled_embedding) enc1_network enc2_netwok

Answer 65

Single machine CPU / GPU Multi GPU is ok ml. m5.2xlarge ml. m5.4xlarge ml. m5.12xlarge ml. p2.xlarge

Answer 66

use ml.p2.2xlarge inference_preferred_mode environment variable to optimize for encoder embeddings rather than classification or regression

Modeling Flashcards

(91 cards)