Modeling Flashcards

(91 cards)

1
Q

Types of Neural Network

A

Feedforward
Convolutional Neural Network
Recurrent Neural Network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Convolutional Neural Network (CNN)

A

Image Classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Recurrent Neural Network

A

for sequences
e.g. Stock Prices, Words in a sentence…

  • LSTM, GRU
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

LSTM full format

A

Long Short Term Memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

GRU full format

A

Gated Recurrent Unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

There might be Feature-Location Invariant, what to do?

A

like not sure where the sign is in our image, then use CNN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

adversarial example

A

An adversarial example is an instance with small, intentional feature perturbations that cause a machine learning model to make a false prediction.

another e.g. Sentiment Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

MaxPooling1D
MaxPooling2D
MaxPooling3D

A

distill the input down to the bear essence of what you need to analyse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Conv1D
Conv2D
Conv3D

A

these layer types does the actual convolution

1D like text
2D like Images
3D like 3D volume metric data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Typical Image process using CNN. what’s the process?

A

Conv2D:
- does the convolution

MaxPooling2D:
- distill down and shrink image

Dropout:
- Prevents overfitting

Flatten:
flatten data to feed it into a perceptron

Dense:
hidden layer of neurons, perceptron

Dropout:

Softmax:
choose the final classification that comes out of the neural network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Name some Specialised Architectures of CNN

A

LeNet-5:
- handwriting recognition

AlexNet:
- Image Classification, Deeper than LeNet

GoogLeNet:
- deeper than LeNet but better performance
it uses Inception Modules.

ResNet:
- Residual Network, even deeper but maintains performance using Skip Connections

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Recurrent Neural Network (RNN) topologies

A

Where sequence matters

  • Sequence to Sequence
  • Sequence to Vector
  • Vector to sequence
  • Encoder -> Decoder
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Sequence to Sequence NN

A

time-series
output time-series

e.g. Stock prices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Sequence to Vector NN

A

e.g. Words in a sentence to sentiments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Vector to Sequence NN

A

e.g. produce a caption from an image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Encoder Decoder

A

e.g.
Sequence to Vector to Sequence

capture words in a french sentence and put them into vectors and then translate to english

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Training RNN

A

Backpropagating both through the neural network and also time

Really hard
Sensitive to hyperparameters
Resource intensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

LSTM

A

maintains both long term and short term states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

GRU

A

Gated Recurrent Unit

Simplified LSTM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What if pick some wrong choices in training a RNN?

A

it might lead to a RNN that doesn’t converge at all

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

AWS offers for Training a neural network?

A

Apache MXNet on EMR

P2, P3, G Instance types

Deep Learning AMI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Major Components of Tuning a Neural Network? (hyperparameters)

A

Some knobs and dials:

  • Learning Rate
  • Batch size
  • epochs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Learning Rate

A

Gradient Descent or other means

Too high LR:
- overshoot the optimal solution

Too Low LR:
- take too long to find the optimal solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Batch Size

A

Small batch sizes can work their way out of local minima more easily

Large Batch sizes can end up getting stuck in the wrong solution

Random shuffling at each epoch can make this look like very inconsistent results from run to run

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Learning Rate and Training
Small LR will increase the training time Large LR can overshoot the correct solution
26
Regularization Techniques. what they do?
it prevents overfitting
27
If you are overfitting?
try simpler model try fewer neurons try fewer layers Dropout: - remove some neurons at random at each training set to force the model to spread itself and learning more Early Stopping: - on the point that accuracy goes high but validation accuracy not
28
Vanishing Gradient Problem
Opposite of Exploding Gradients Vanishing Gradient is when the slope of the learning curve approaches zero
29
Addressing Vanishing Gradient Problem
Multi-level hierarchy - train sub-networks instead of the whole network LSTM Residual Network - ResNet, for object recognition Better choices of Activation Function - ReLu
30
Gradient Checking
a debugging technique Numerically check the derivatives computed during training Useful for validating code of neural network
31
L1 and L2 Regularazation
L1 is sum of the weights L2 is sum of square of the weights to prevent over fitting
32
L1 and L2 differences?
L1: Sum of weights - performs feature selection - Computationally inefficient - sparse output L2: Sum of square of weights - All features remain considered. just weighted - computationally efficient - Dense output
33
Why L1 then over L2?
Feature selection reduces the dimensionality - out of 100 features, maybe only 10 endup with non-0 coefficients - resulting sparsity can make up for its computational inefficiency on the other side, if you think all the features are important, then go for L2
34
Confusion Matrix T/PN F/PN
Predicted Yes, Actual Yes - True Positive Predicted Yes, Actual No - False Positive Predicted No, Actual Yes - False Negative Predicted No, Actual No - True Negative
35
Multi-class confusion matrix
including a heat map it's useful for multi-class classification
36
Precision
TP / TP+FP Captured over Number of nominated AKA - Percent of relevant results - Correct Positives when FP are important e.g. Medical screening, drug testing
37
Recall
TP / TP+FN Captured over came AKA: - Sensitivity, TP rate, Completeness - % of negatives wrongly predicted Good for when FN is critical - e.g. Fraud Detection
38
F1
2TP / 2TP+FP+FN 2 * (Precision*Recall)/(Precision+Recall) Harmonic mean of precision and sensitivity when you care about precision and recall
39
Specificity
TN / TN+FP | True Negative rate
40
RMSE
Root mean squared error Accuracy measurement Only care about right and wrong answers
41
ROC
Receiver Operating Characteristic Curve TP vs FP at various threshold setting points above diagonal represent good classification better than random the more it's bent toward upper-left the better
42
ROC
Receiver Operating Characteristic Curve TP vs FP at various threshold setting points above diagonal represent good classification better than random the more it's bent toward upper-left the better
43
Ensemble Learning
Ensemble model takes multiple model and they might be just variations of the same model and lets them all vote on the final result Bagging and Boosting
44
Are Decision Trees prone to overfitting?
yes they are!
45
Bagging
Generate multiple training sets by random sampling with replacement each resampled model can be trained in parallel they end up being a more robust than a single model
46
Boosting
works in a serial manner vs the parallel bagging it assigns weights to each observation to a dataset Training is sequential starting with equal weight for each observation
47
Bagging vs Boosting
XGBoost is one hot algorithm today
48
XGBoost's strength
Accuray
49
What Bagging is good for?
avoid overfitting having a regularization effect Bagging is easier to parallelize
50
Some Ideal file format for SageMaker to fetch data from S3
RecordIO | Protobuf
51
Can you use SageMaker within Spark?
yes you can
52
SageMaker Neo
to deploy to Edge Devices
53
SageMaker | - Linear Learner
Linear Regression - Numeric Prediction - Classification (binary/multi-class)
54
Linear Learner input format
Performant options: - RecordIO-wrapped protobuf (float32 only) CSV - First column assumed to be header File or Pipe mode both supported
55
File vs Pipe mode
File mode copy data to the fleet Pipe will stream required data only, that's why it's more efficient
56
Linear Learner Preprocessing
Normalized - all features weighted the same - Linear Maker can do this optionally Shuffle the data
57
Linear Learner Training
Uses SGD - optimization algorithms: Adam, AdaGrad, SGD, etc Multiple models are optimized in parallel Tune L1, L2 regularization
58
Linear Learner Validation
most optimal model is selected
59
Linear Learner Hyperparameters
``` Balance_multiclass_weights - gives each class equal importance in loss functions ``` Learning_rate, mini_batch_size L1 - Regularization Wd - Weight decay (L2 regularization)
60
Linear Learner Instance types
Single or multi-machine CPU/GPU Multi-GPU does not help in this case
61
XGBoost
eXtreme Gradient Boosting - Boosted group of decision trees - New trees made to correct the errors of previous trees - Uses gradient descent to minimize loss as new trees are added
62
XGBoost industry trend
on Kaggle it is talk of the town | and it is also fast (not resource intensive)
63
XGBoost is for Classification or Regression?
Both it does regression as well using regression trees
64
XGBoost input format
CSV or Libsvm no Protobuf here
65
XGBoost, how is it used?
models are serialized/deserialized with pickle can use a framework within notebooks - sagemaker.xgboost or as a built-in algorithm
66
XGBoost Hyperparameters
Subsample - Prevent overfitting Eta - Step size shrinkage, prevents overfitting Gamma - Minimum loss reduction to create a partition - larger value = more conservative Alpha - L1 regularization term - Larger value= more conservative Lambda - L2 regularization term - Larger = more conservative
67
XGBoost instance types
CPU only in-memory bound, not compute bound M4 is a good choice
68
Seq2Seq use cases
input sequence of tokens output sequence of tokens Machine Translation Text Summarization Speech to text Implemented with RNN's and CNN's with attention
69
Seq2Seq training input
RecordIO-Protobuf - Tokens must be integers (yes others mostly want floating point data) Start with Tokenized text files Convert to Protobuf using sample code - Packs into integer tensors with vocabulary files - a lot like the TF-IDF lab Must provide: - Training Data - Validation Data - Vocabulary files
70
Is there any pre-trained model for SageMaker?
Yay there are many also public training dataset are available for specific translation tasks.
71
Seq2Seq Hyperparameters
Batch_size Optimizer_type: - adam - sgd - rmsprop Learning_rate Num_layers_encoder/decoder Can optimize on: - Accuracy: vs. provided validation dataset - BLEU score: compares against multiple ref translation - Perplexity: cross-entropy
72
Seq2Seq instance type
only GPU (e.g. P3) only one machine but it can come with multiple GPUs
73
DeepAR
Forecasting one-dimensional time series data uses RNN's Allows training the same model over several related time series Fins frequencies and seasonality
74
DeepAR training input
JSON lines format - GZIP or Parquet Each record must contain: - Start timestamp - Target Each record can contain: - Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a time series) - Cat: Categorical features
75
DeepAR | Which part of data for input?
always include the entire time series for training, Testing and inference use entire dataset as test set, remove last time points for training. evaluate on withheld values. Don't use very large values for prediction length (>400) Train on many time series and not just one when possible
76
maximum recommended prediction length in DeepAR
400 data points in future
77
DeepAR Hyperparameters - context_length ? others - epoch - mini_batch_size - learning_rate - num_cells
Number of time points the model sees before making prediction can be smaller than seasonalities, the model will lag one year anyhow
78
DeepAR instance types
CPU / GPU Single / Multi machine economically feasible: C4.2xlarge / C4.4xlarge move to gpu if necessary CPU-only for inference may need larger instances for tuning
79
BlazingText | good for?
- text classification predict labels for sentences (not documents) supervised useful in web searches, information retrieval - word2vec word embedding useful for NLP but is not an NLP algorithm Machine translation, Sentiment Analysis works on individual words, not sentences or dox
80
BlazingText input
- Text Classification one sentence per line __label__{label} {sentence} also possible to use augmented manifest text format {"source":"","label":""} - word2vec a text file with one training sentence/line
81
word2vec of BlazingText. what modes of operation are available?
- Cbow (continues bag of words) order of the words is being thrown out and just the words themselves matter - skip-gram n-gram (which order of the words does matter) - Batch skip-gram distributed of computation over many cpu nodes.
82
BlazingText hyperparameters:
``` - Word2vec: mode (batch_skipgram, skipgram, cbow) learning_rate window_size vector_dim negative_samples ``` ``` - Text Classification: epoch learning_rate word_ngrams vector_dim ```
83
Blazing Text instance types
- cbow, skipgram ml.p3.2xlarge any single cpu or single gpu instance work - batch_skipgram single or multiple cpu instances (scale horizontally) - text classification C5 for < 2GB training data larger dataset, single GPU instance (ml.p2/3.xlarge)
84
Object2vec
Like word2vec but for arbitrary objects low dimensional dense embedding of high-dimensional objects compute nearest neighbors of objects Visualize clusters Genre predictions Recommendations (similar items to users) unsupervised
85
does the Object2vector performs unsupervised or supervised ?
unsupervised so you don't need to train it. it can automatically figure out what similarities are based on the inherit of data within its features
86
Object2vec inputs
data must be tokenized into integers consist of pairs of tokens and/or sequences of tokens - sentence - sentence - labels - sequence ( genre to description) - product - product - user - item
87
object2vec encoder choices?
Average-pooled embeddings CNN Bidirectional LSTM
88
object2vec, how to use?
process data into JSON lines and shuffle it train with two input channels, two encoders and a comparator choose an encoder Comparator is followed by a feed-forward neural network
89
object2vec hyperparameters
``` dropout early stopping epochs learning rate batch size layers activation function optimizer weight decay ``` Encoders: (hcnn, bilstm, pooled_embedding) enc1_network enc2_netwok
90
object2vec instance type
Single machine CPU / GPU Multi GPU is ok ml. m5.2xlarge ml. m5.4xlarge ml. m5.12xlarge ml. p2.xlarge
91
object2vec inference instance type
use ml.p2.2xlarge inference_preferred_mode environment variable to optimize for encoder embeddings rather than classification or regression