Modeling Flashcards

1
Q

Types of Neural Network

A

Feedforward
Convolutional Neural Network
Recurrent Neural Network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Convolutional Neural Network (CNN)

A

Image Classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Recurrent Neural Network

A

for sequences
e.g. Stock Prices, Words in a sentence…

  • LSTM, GRU
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

LSTM full format

A

Long Short Term Memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

GRU full format

A

Gated Recurrent Unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

There might be Feature-Location Invariant, what to do?

A

like not sure where the sign is in our image, then use CNN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

adversarial example

A

An adversarial example is an instance with small, intentional feature perturbations that cause a machine learning model to make a false prediction.

another e.g. Sentiment Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

MaxPooling1D
MaxPooling2D
MaxPooling3D

A

distill the input down to the bear essence of what you need to analyse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Conv1D
Conv2D
Conv3D

A

these layer types does the actual convolution

1D like text
2D like Images
3D like 3D volume metric data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Typical Image process using CNN. what’s the process?

A

Conv2D:
- does the convolution

MaxPooling2D:
- distill down and shrink image

Dropout:
- Prevents overfitting

Flatten:
flatten data to feed it into a perceptron

Dense:
hidden layer of neurons, perceptron

Dropout:

Softmax:
choose the final classification that comes out of the neural network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Name some Specialised Architectures of CNN

A

LeNet-5:
- handwriting recognition

AlexNet:
- Image Classification, Deeper than LeNet

GoogLeNet:
- deeper than LeNet but better performance
it uses Inception Modules.

ResNet:
- Residual Network, even deeper but maintains performance using Skip Connections

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Recurrent Neural Network (RNN) topologies

A

Where sequence matters

  • Sequence to Sequence
  • Sequence to Vector
  • Vector to sequence
  • Encoder -> Decoder
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Sequence to Sequence NN

A

time-series
output time-series

e.g. Stock prices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Sequence to Vector NN

A

e.g. Words in a sentence to sentiments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Vector to Sequence NN

A

e.g. produce a caption from an image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Encoder Decoder

A

e.g.
Sequence to Vector to Sequence

capture words in a french sentence and put them into vectors and then translate to english

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Training RNN

A

Backpropagating both through the neural network and also time

Really hard
Sensitive to hyperparameters
Resource intensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

LSTM

A

maintains both long term and short term states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

GRU

A

Gated Recurrent Unit

Simplified LSTM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What if pick some wrong choices in training a RNN?

A

it might lead to a RNN that doesn’t converge at all

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

AWS offers for Training a neural network?

A

Apache MXNet on EMR

P2, P3, G Instance types

Deep Learning AMI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Major Components of Tuning a Neural Network? (hyperparameters)

A

Some knobs and dials:

  • Learning Rate
  • Batch size
  • epochs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Learning Rate

A

Gradient Descent or other means

Too high LR:
- overshoot the optimal solution

Too Low LR:
- take too long to find the optimal solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Batch Size

A

Small batch sizes can work their way out of local minima more easily

Large Batch sizes can end up getting stuck in the wrong solution

Random shuffling at each epoch can make this look like very inconsistent results from run to run

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Learning Rate and Training

A

Small LR will increase the training time

Large LR can overshoot the correct solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Regularization Techniques. what they do?

A

it prevents overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

If you are overfitting?

A

try simpler model
try fewer neurons
try fewer layers

Dropout:
- remove some neurons at random at each training set to force the model to spread itself and learning more

Early Stopping:
- on the point that accuracy goes high but validation accuracy not

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Vanishing Gradient Problem

A

Opposite of Exploding Gradients

Vanishing Gradient is when the slope of the learning curve approaches zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Addressing Vanishing Gradient Problem

A

Multi-level hierarchy
- train sub-networks instead of the whole network

LSTM

Residual Network
- ResNet, for object recognition

Better choices of Activation Function
- ReLu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Gradient Checking

A

a debugging technique

Numerically check the derivatives computed during training

Useful for validating code of neural network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

L1 and L2 Regularazation

A

L1 is sum of the weights
L2 is sum of square of the weights

to prevent over fitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

L1 and L2 differences?

A

L1: Sum of weights

  • performs feature selection
  • Computationally inefficient
  • sparse output

L2: Sum of square of weights

  • All features remain considered. just weighted
  • computationally efficient
  • Dense output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Why L1 then over L2?

A

Feature selection reduces the dimensionality

  • out of 100 features, maybe only 10 endup with non-0 coefficients
  • resulting sparsity can make up for its computational inefficiency

on the other side, if you think all the features are important, then go for L2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Confusion Matrix
T/PN
F/PN

A

Predicted Yes, Actual Yes
- True Positive

Predicted Yes, Actual No
- False Positive

Predicted No, Actual Yes
- False Negative

Predicted No, Actual No
- True Negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Multi-class confusion matrix

A

including a heat map it’s useful for multi-class classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Precision

A

TP / TP+FP
Captured over Number of nominated

AKA

  • Percent of relevant results
  • Correct Positives

when FP are important
e.g. Medical screening, drug testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Recall

A

TP / TP+FN
Captured over came

AKA:

  • Sensitivity, TP rate, Completeness
  • % of negatives wrongly predicted

Good for when FN is critical
- e.g. Fraud Detection

38
Q

F1

A

2TP / 2TP+FP+FN

2 * (Precision*Recall)/(Precision+Recall)

Harmonic mean of precision and sensitivity

when you care about precision and recall

39
Q

Specificity

A

TN / TN+FP

True Negative rate

40
Q

RMSE

A

Root mean squared error
Accuracy measurement
Only care about right and wrong answers

41
Q

ROC

A

Receiver Operating Characteristic Curve

TP vs FP at various threshold setting

points above diagonal represent good classification
better than random

the more it’s bent toward upper-left the better

42
Q

ROC

A

Receiver Operating Characteristic Curve

TP vs FP at various threshold setting

points above diagonal represent good classification
better than random

the more it’s bent toward upper-left the better

43
Q

Ensemble Learning

A

Ensemble model takes multiple model and they might be just variations of the same model and lets them all vote on the final result

Bagging and Boosting

44
Q

Are Decision Trees prone to overfitting?

A

yes they are!

45
Q

Bagging

A

Generate multiple training sets by random sampling with replacement

each resampled model can be trained in parallel

they end up being a more robust than a single model

46
Q

Boosting

A

works in a serial manner vs the parallel bagging

it assigns weights to each observation to a dataset

Training is sequential starting with equal weight for each observation

47
Q

Bagging vs Boosting

A

XGBoost is one hot algorithm today

48
Q

XGBoost’s strength

A

Accuray

49
Q

What Bagging is good for?

A

avoid overfitting
having a regularization effect

Bagging is easier to parallelize

50
Q

Some Ideal file format for SageMaker to fetch data from S3

A

RecordIO

Protobuf

51
Q

Can you use SageMaker within Spark?

A

yes you can

52
Q

SageMaker Neo

A

to deploy to Edge Devices

53
Q

SageMaker

- Linear Learner

A

Linear Regression

  • Numeric Prediction
  • Classification (binary/multi-class)
54
Q

Linear Learner input format

A

Performant options:
- RecordIO-wrapped protobuf (float32 only)

CSV
- First column assumed to be header

File or Pipe mode both supported

55
Q

File vs Pipe mode

A

File mode copy data to the fleet

Pipe will stream required data only, that’s why it’s more efficient

56
Q

Linear Learner Preprocessing

A

Normalized

  • all features weighted the same
  • Linear Maker can do this optionally

Shuffle the data

57
Q

Linear Learner Training

A

Uses SGD
- optimization algorithms: Adam, AdaGrad, SGD, etc

Multiple models are optimized in parallel

Tune L1, L2 regularization

58
Q

Linear Learner Validation

A

most optimal model is selected

59
Q

Linear Learner Hyperparameters

A
Balance_multiclass_weights
- gives each class equal importance in loss functions 

Learning_rate, mini_batch_size

L1
- Regularization

Wd
- Weight decay (L2 regularization)

60
Q

Linear Learner Instance types

A

Single or multi-machine CPU/GPU

Multi-GPU does not help in this case

61
Q

XGBoost

A

eXtreme Gradient Boosting

  • Boosted group of decision trees
  • New trees made to correct the errors of previous trees
  • Uses gradient descent to minimize loss as new trees are added
62
Q

XGBoost industry trend

A

on Kaggle it is talk of the town

and it is also fast (not resource intensive)

63
Q

XGBoost is for Classification or Regression?

A

Both

it does regression as well using regression trees

64
Q

XGBoost input format

A

CSV or Libsvm

no Protobuf here

65
Q

XGBoost, how is it used?

A

models are serialized/deserialized with pickle

can use a framework within notebooks
- sagemaker.xgboost

or as a built-in algorithm

66
Q

XGBoost Hyperparameters

A

Subsample
- Prevent overfitting

Eta
- Step size shrinkage, prevents overfitting

Gamma

  • Minimum loss reduction to create a partition
  • larger value = more conservative

Alpha

  • L1 regularization term
  • Larger value= more conservative

Lambda

  • L2 regularization term
  • Larger = more conservative
67
Q

XGBoost instance types

A

CPU only
in-memory bound, not compute bound

M4 is a good choice

68
Q

Seq2Seq use cases

A

input sequence of tokens
output sequence of tokens

Machine Translation
Text Summarization
Speech to text

Implemented with RNN’s and CNN’s with attention

69
Q

Seq2Seq training input

A

RecordIO-Protobuf
- Tokens must be integers (yes others mostly want floating point data)

Start with Tokenized text files

Convert to Protobuf using sample code

  • Packs into integer tensors with vocabulary files
  • a lot like the TF-IDF lab

Must provide:

  • Training Data
  • Validation Data
  • Vocabulary files
70
Q

Is there any pre-trained model for SageMaker?

A

Yay there are many

also public training dataset are available for specific translation tasks.

71
Q

Seq2Seq Hyperparameters

A

Batch_size

Optimizer_type:

  • adam
  • sgd
  • rmsprop

Learning_rate

Num_layers_encoder/decoder

Can optimize on:

  • Accuracy: vs. provided validation dataset
  • BLEU score: compares against multiple ref translation
  • Perplexity: cross-entropy
72
Q

Seq2Seq instance type

A

only GPU (e.g. P3)

only one machine but it can come with multiple GPUs

73
Q

DeepAR

A

Forecasting one-dimensional time series data

uses RNN’s

Allows training the same model over several related time series

Fins frequencies and seasonality

74
Q

DeepAR training input

A

JSON lines format
- GZIP or Parquet

Each record must contain:

  • Start timestamp
  • Target

Each record can contain:

  • Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a time series)
  • Cat: Categorical features
75
Q

DeepAR

Which part of data for input?

A

always include the entire time series for training, Testing and inference

use entire dataset as test set, remove last time points for training. evaluate on withheld values.

Don’t use very large values for prediction length (>400)

Train on many time series and not just one when possible

76
Q

maximum recommended prediction length in DeepAR

A

400 data points in future

77
Q

DeepAR Hyperparameters

  • context_length ?

others

  • epoch
  • mini_batch_size
  • learning_rate
  • num_cells
A

Number of time points the model sees before making prediction

can be smaller than seasonalities, the model will lag one year anyhow

78
Q

DeepAR instance types

A

CPU / GPU
Single / Multi machine

economically feasible: C4.2xlarge / C4.4xlarge
move to gpu if necessary

CPU-only for inference
may need larger instances for tuning

79
Q

BlazingText

good for?

A
  • text classification
    predict labels for sentences (not documents)
    supervised
    useful in web searches, information retrieval
  • word2vec
    word embedding
    useful for NLP but is not an NLP algorithm
    Machine translation, Sentiment Analysis
    works on individual words, not sentences or dox
80
Q

BlazingText input

A
  • Text Classification
    one sentence per line
    __label__{label} {sentence}

also possible to use augmented manifest text format
{“source”:””,”label”:””}

  • word2vec
    a text file with one training sentence/line
81
Q

word2vec of BlazingText. what modes of operation are available?

A
  • Cbow (continues bag of words)
    order of the words is being thrown out and just the words themselves matter
  • skip-gram
    n-gram (which order of the words does matter)
  • Batch skip-gram
    distributed of computation over many cpu nodes.
82
Q

BlazingText hyperparameters:

A
- Word2vec:
mode (batch_skipgram, skipgram, cbow)
learning_rate
window_size
vector_dim
negative_samples
- Text Classification: 
epoch
learning_rate
word_ngrams
vector_dim
83
Q

Blazing Text instance types

A
  • cbow, skipgram
    ml.p3.2xlarge
    any single cpu or single gpu instance work
  • batch_skipgram
    single or multiple cpu instances (scale horizontally)
  • text classification
    C5 for < 2GB training data
    larger dataset, single GPU instance (ml.p2/3.xlarge)
84
Q

Object2vec

A

Like word2vec but for arbitrary objects

low dimensional dense embedding of high-dimensional objects

compute nearest neighbors of objects

Visualize clusters

Genre predictions

Recommendations (similar items to users)

unsupervised

85
Q

does the Object2vector performs unsupervised or supervised ?

A

unsupervised
so you don’t need to train it. it can automatically figure out what similarities are based on the inherit of data within its features

86
Q

Object2vec inputs

A

data must be tokenized into integers

consist of pairs of tokens and/or sequences of tokens

  • sentence - sentence
  • labels - sequence ( genre to description)
  • product - product
  • user - item
87
Q

object2vec encoder choices?

A

Average-pooled embeddings
CNN
Bidirectional LSTM

88
Q

object2vec, how to use?

A

process data into JSON lines and shuffle it

train with two input channels, two encoders and a comparator

choose an encoder

Comparator is followed by a feed-forward neural network

89
Q

object2vec hyperparameters

A
dropout 
early stopping
epochs
learning rate
batch size
layers
activation function
optimizer
weight decay

Encoders: (hcnn, bilstm, pooled_embedding)
enc1_network
enc2_netwok

90
Q

object2vec instance type

A

Single machine CPU / GPU
Multi GPU is ok

ml. m5.2xlarge
ml. m5.4xlarge
ml. m5.12xlarge
ml. p2.xlarge

91
Q

object2vec inference instance type

A

use ml.p2.2xlarge
inference_preferred_mode environment variable to optimize for encoder embeddings rather than classification or regression