7. Model Building Flashcards

1
Q

What is data parallelism?

A

Data parallelism is when the dataset is split into parts and then assigned to parallel computational machines or graphics processing units (GPUs). A small batch of data is sent to every node, and the gradient is computed normally and sent back to the main node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two strategies in data parallelism?

A

Synchronous Training: The model sends different parts of the data into each accelerator or GPU. Every GPU has a complete copy of the model and is trained solely on a part of the data.
Asynchronous Training: workers don’t have to wait for each other and all workers are independently training over the input data and updating variables asynchronously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is “all-reduce sync” strategy good for?

A

It is great for Tensor Processing Unit (TPU) and one‐machine multi‐GPUs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is tf.distribute.Strategy used for?

A

It is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the TensorFlow distributed training strategies?

A

MirroredStrategy: Synchronous distributed training on multiple GPUs on one machine.
CentralStorageStrategy: Synchronous training but no mirroring meaning model variables are kept in the CPU.
MultiWorkerMirroredStrategy: Synchronous distributed training across multiple workers, each with potentially multiple GPUs or multiple machines.
TPUStrategy: Synchronous distributed training on multiple TPU cores.
ParameterServerStrategy: Some machines are designated as workers and some as parameter servers.

Hints: Monkeys Climb More Than Pandas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is model parallelism?

A

In model parallelism, every model is partitioned into parts, just as with data parallelism. Each model is then placed on an individual GPU. It can be used to train a model such that it does not fit into just a single GPU.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the best architecture for synchronous distribution training for TensorFlow?

A

All-reduce architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the best architecture for asynchronous distribution training for TensorFlow?

A

Parameter server architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the tools for deploying TensorFlow models?

A

tf.serving
TFLite for mobile devices
TensorFlow.js for browsers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Convolutional Neural networks usually used for?

A

Image classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Recurrent Neural networks usually used for?

A

It is designed to operate upon sequences of data. It can be used for text classification or prediction of values in the sequence, e.g., long short-term memory network. It can also be used for time-series and speech recognition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do you use to train a neural network?

A

Stochastic gradient descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the goal of training a neuron network?

A

To find a set of weights and biases that have low loss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is loss in neuron network used for?

A

The loss is used to calculate the gradients.
Gradients are used to update the weights of the neural network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the outputs of regression, binary classification and multiclass classification?

A

Numerical
Binary
Single label multclass

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the activation functions of regression, binary classification and multiclass classification?

A

One node with a linear activation unit
Sigmoid activation unit
Softmax activation function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the loss functions of regression, binary classification and multiclass classification?

A

MSE
Binary cross-entropy, categorical hinge loss and squared hinge loss (Keras)
Categorical cross-entropy and sparse categorical cross-entropy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When do you use sparse categorical cross-entropy and categorical cross-entropy?

A

Use sparse categorical cross‐entropy when your classes are mutually exclusive and categorical cross‐entropy when one sample can have multiple classes or labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is gradient descent?

A

The gradient descent algorithm calculates the gradient of the loss curve at the starting point.
The gradient of the loss is equal to the derivative (slope) of the curve.
The gradient has both magnitude and direction (vector) and always points in the direction of the steepest increase in the loss function.
The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss.

20
Q

What is learning rate?

A

Also known as Step size
gradient x learning rate = next point

21
Q

What is batch?

A

In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration.

22
Q

What is batch size?

A

Batch size is the number of examples in a batch.

23
Q

What is epoch?

A

An epoch means an iteration for training the neural network with all the training data. A forward pass and a backward pass together are counted as one pass.

24
Q

What are the examples of hyperparameter in neuron network?

A

Loss, learning rate, batch size, and epoch are hyperparameters

25
Q

What are the characteristics of different batch sizes?

A

Larger batch sizes take less time to train but are less accurate.
If batch size is too small, training will bounce around; if it’s too large, training will take a very long time.
Large batch size can lead to out of memory error while training neural networks.

26
Q

What do small and large learning rate mean?

A

If the learning rate is too small, training will take ages; if it’s too large, training will bounce around and ultimately diverge.

27
Q

What is transfer learning?

A

Transfer learning is an optimization to save time or get better performance. You can use an available pretrained model, which can be used as a starting point for training your own model.
Transfer learning can enable you to develop models even for problems where you may not have very much data.

28
Q

When do you use semi-supervised learning?

A

When you don’t have enough labeled data to produce an accurate model and you don’t have the resources to get more data, you can use semi‐supervised techniques to increase the size of your training data.

29
Q

What is the limitation of semi-supervised training?

A

If the portion of labeled data isn’t representative of the entire distribution, the approach may fall short.

30
Q

What is data augmentation?

A

To get more data to train the neural networks, you need to make minor alterations to your existing dataset such as flips or translations or rotations.

31
Q

What are the two types of data augmentations?

A

Offline: All augmentations are done before training
Online: augmentations are done on-the-fly and it is preferred for large datasets.

32
Q

What is bias?

A

Bias is the difference between the average prediction of our model and the correct value we are trying to predict. It is actually the error rate of the training data.

33
Q

What is variance?

A

The error rate of the testing data is called variance.
A model with high variance pays a lot of attention to training data and does not generalize on the data it hasn’t seen before.

34
Q

What is bias variance trade-off

A

You need to find the right balance without overfitting or underfitting the data. If your model is too simple and has very few parameters, then it may have high bias and low variance. If our model has a large number of parameters, then it’s going to have high variance and low bias.

35
Q

What is underfitting?

A

An underfit model fails to sufficiently learn the problem and performs poorly on a training dataset and does not perform well on a test or validation dataset.
An underfit model has high bias and low variance.

36
Q

What are the reasons for model underfitting?

A

Data used for training is not cleaned.
The model has a high bias.

37
Q

How to reduce underfitting?

A

Increase model complexity.
Increase the number of features by performing feature engineering.
Remove noise from the data.
Increase the number of epochs or increase the duration of training to get better results.

38
Q

What is overfitting?

A

The model learns the training data too well and performance varies widely with new unseen examples or even statistical noise added to examples in the training dataset. An overfit model has low bias and high variance.

39
Q

How to reduce overfitting?

A

Reduce overfitting by training the network on more examples.
Reduce overfitting by changing the complexity of network structure and parameters.

40
Q

How to avoid overfitting?

A

Regularization technique
Dropout: Probabilistically remove inputs during training.
Noise: Add statistical noise to inputs during training.
Early stopping: Monitor model performance on a validation set and stop training when performance degrades.
Data augmentation.
Cross‐validation.

Hints: Dolphins Never Eat Apple Cores.

41
Q

What is regularization?

A

It tunes the loss function by adding a penalty term that prevents excessive fluctuation of the coefficients, thereby reducing the chances of overfitting.

42
Q

What is L1 regularization?

A

L1 regularization shrinks the parameters toward 0.
Penalizes the sum of absolute values of the weights.
Built‐in feature selection
Robust to outliers
Reduce model size

43
Q

What is L2 regularization?

A

L2 (ridge) forces weights to be small
Penalizes the sum of squares of the weights
Doesn’t perform feature selection
Not robust to outliers
Improve generalization in linear models

44
Q

What are the common issues for neural network training?

A

Exploding gradients: Too large to converge. Use batch normalization and lower learning rate
Dead ReLU units: Once the weighted sum for a ReLU unit falls below 0, the ReLU unit can get stuck. Lowering the learning rate can help keep ReLU units from dying.
Vanishing gradients: The gradients for the lower layers (closer to the input) can become very small. When the gradients vanish toward 0 for the lower layers, these layers train very slowly or they do not train at all. The ReLU activation function can help prevent vanishing gradients.
Dropout regularization to prevent overfitting: This type of regularization is useful for neural networks. It works by randomly “dropping out” unit activations in a network for a single gradient step. The more you drop out, the stronger the regularization: 0.0 = No dropout regularization.
1.0 = Drop out everything. The model learns nothing.
Values between 0.0 and 1.0 = More useful.

45
Q

How to reduce training loss?

A

If the features don’t add information relative to existing features, try a different feature.
Decrease the learning rate.
Increase the depth and width of your layers
If you have lots of data, use held‐out test data.
If you have little data, use cross‐validation or bootstrapping.

46
Q

What are the possible reasons for model training not converging and bouncing around?

A

Features might not have predictive power.
Raw data might not comply with the defined schema.
Learning rate seems high, and you need to decrease it.
Reduce your training set to few examples to obtain a very low loss.
Start with one or two features (and a simple model) that you know have predictive power and see if the model overperforms your baseline.