It is a 'knob' with 2 functions , contradicts orthogonalization: - adjusts the depth of a DL network - adjusts regularization

- use a single metric for project evaluation ; in combination with a well-defined DEV set it will speed-up the iterations i. e. F1 instead of precision, recall; average for error across geo regions - use an optimizing metric subject to one or more satisficing metrics (does better than a threshold) - change the evaluation metric (and DEV set) when no longer ranks correctly the estimator performance i. e. add a weight that will increase the error for samples to help the classifier's discriminative power

- bias reduction tactics (diff estimator, larger DL network) when the training error is far from human-level error used as a proxy for Bayes error - variance reduction tactics (regularization, larger training set) when the DEV set error is far from TRAIN error - avoidable bias = TRAIN err - HL err - variance = DEV err - TRAIN err

ML_Projects (Coursera) Flashcards by F. B.

Chain of assumptions - orthogonalization

Concept similar to independent (orthogonalized) TV-set adjustment knobs - one knob, one problem fix

fit training set well on COST function; knobs: bigger network, different optimization algo
fit dev set well on COST function; knobs: regularization, bigger training set
fit test set well on COST function; knobs: bigger dev set
perform well in live settings; knobs: change dev set, change COST function

How well did you know this?

Not at all

Perfectly

Early-stopping cons

It is a ‘knob’ with 2 functions, contradicts orthogonalization:

adjusts the depth of a DL network
adjusts regularization

How well did you know this?

Not at all

Perfectly

Using metric goals

use a single metric for project evaluation; in combination with a well-defined DEV set it will speed-up the iterations
i. e. F1 instead of precision, recall; average for error across geo regions
use an optimizing metric subject to one or more satisficing metrics (does better than a threshold)
change the evaluation metric (and DEV set) when no longer ranks correctly the estimator performance
i. e. add a weight that will increase the error for samples to help the classifier’s discriminative power

How well did you know this?

Not at all

Perfectly

Using TRAIN/DEV/TEST sets

use the DEV set (to optimize the estimator) and the TEST set (to evaluate the generalization error) that come from the SAME distribution (‘aiming at target’ paradigm)
for big data (i.e. 1 mil samples) split: 98% train, 1%dev, 1% test
for normal data (i.e. thousands samples) split: 60%-20%-20%
pick up the size of TEST set to have high confidence in the final evaluation

How well did you know this?

Not at all

Perfectly

Orhtogonalization for metric

define the metric that will correctly capture the information for the estimator
figure out how to do well (optimize) - is the estimator performing for the metric
if doing well using DEV test + metric is not translating in doing well using application, change metric and/or DEV set

How well did you know this?

Not at all

Perfectly

Human-level performance

an algorithm may surpass human-level performance => will slow down in its performance approaching but never reaching the Bayes optimal error
in many tasks (natural data tasks) human-level performance is very close to Bayes optimal error

How well did you know this?

Not at all

Perfectly

Improving algo to human-level performance

get human-labeled data
do manual error analysis: find out why human does better and incorporate into algo
better bias/variance analysis

How well did you know this?

Not at all

Perfectly

Bias/variance

bias reduction tactics (diff estimator, larger DL network) when the training error is far from human-level error used as a proxy for Bayes error
variance reduction tactics (regularization, larger training set) when the DEV set error is far from TRAIN error
avoidable bias = TRAIN err - HL err
variance = DEV err - TRAIN err

How well did you know this?

Not at all

Perfectly

Human-level performance

The importance of HL performance is in its use as a Bayes error in human perception tasks; also, in other papers the state-of-the-art can also be seen as a proxy for the Bayes error
Once the HL performance is surpased is much more difficult to improve the ML algorithm

How well did you know this?

Not at all

Perfectly

Techniques for supervised learning

doing well on training set (how good are the assumptions): small avoidable bias; if not then:
- train another model
- train longer/better optimization algo (add momentum, RMSprop, adam)
- diff NN architecture, hyperparam search
doing well on DEV/test sets (generalizes well): small variance; if not then:
- more data
- regularization: l2, dropout, data augmentation
- diff NN architecture, hyperparam search

How well did you know this?

Not at all

Perfectly

Simple error analysis

do an analysis on 100 random mislabelled examples, FP and FN, and find the counts for different categories of errors, for example 5 dogs;
the relative percentages (“ceiling” on performance) will represent how much the actual performance could be improved, for 5 of 100 from 10% to 9.5%, 50 of 100 from 10% to 5%.
results would suggest which options to pursue for improvements

How well did you know this?

Not at all

Perfectly

Incorrectly labeled samples - training set

DL algos are robust to (near) random errors in the training set
DL algos are not robust at systematic errors in trainging set i.e. misclassifying the same type of image (white dogs = cats)

How well did you know this?

Not at all

Perfectly

Incorrectly labeled samples - dev set

fix labels when incorrect label errors are a significant percentage from the overall dev set errors i.e. 0.6% ~ 30% of 2%
having a significant percentage of incorrect labels in dev set is bad for the goal of selecting between 2 models as one cannot trust the dev set

How well did you know this?

Not at all

Perfectly

review corrected dev and test sets

!!! make sure that they are still from the same distribution
review examples that algo got both right and wrong: 100 or 200 examples
it’s ok that train set distribution may be slightly different

How well did you know this?

Not at all

Perfectly

different training and testing data distributions - bad option

i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
bad option: mix and shuffle the original sets followed by random split will generate sets that in expectation will preserve the original ratios i.e. 95% items of the dev set will still be from the original training set to optimize against

How well did you know this?

Not at all

Perfectly

different training and testing data distributions - good option

Study These Flashcards

i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
good option: training set: add half (5k) of the small set to large set (200k), and test set: split the remaining small set into dev and test sets (2.5k and 2.5k) to optimize against
this ensures that the model is optimized against the ‘target’ (real life app) and dev & test are from the same distribution!!!

handling mismatched data distribution for training and dev sets

Study These Flashcards

counter 2 effects on the results: generalization (variance) and data mismatch problems
generalization/variance: use a training-dev (split from training) to ensure that the training-dev results are based on the same distribution
data mismatch: compare the training-dev results with dev results to spot this problem

performance/error levels - mismatched data distributions

Study These Flashcards

human level (or state-of-the-art) error: HLE
training set error: TRE
training-dev set error: TRDE
dev set error: DE
test set error: TSE

bias/variance - mismatched data distributions

Study These Flashcards

avoidable bias = TRE - HLE
variance: TRDE -TRE
data mismatch: DE - TRDE
overfitting to dev set: TSE -DE

table representation - bias/variance analysis data-mismatch

Study These Flashcards

columns: training distribution data, dev/test distribution data
rows: human level err, data trained on err, data not trained on err
(1, 1) = HLE, (2, 1) = TRE, (3, 1) = TRDE, (3, 2) = DE or TSE
optional (1, 2) = human level on dev/test

addressing data mismatch training/test sets

Study These Flashcards

do a manual error analysis to understand the differences
make the training data more similar i.e. synthetic data - with the caveat that it can overfit to the small synthetic data set - 1 hour of car noise that is a too small sample
collect training data into conditions similar to dev/test sets

building the first system

Study These Flashcards

setup a ‘target’: dev + test set and a metric
build the first model/system quick and dirty
use Bias/Variance and Error analysis to decide on next steps
iterate prioritizing and improving the system
if the system is fairly new, DON’T overthink!! to get it going
if is an existing body of knowledge, is ok to start with that but still DON’T overthink

transfer learning

Study These Flashcards

use a pre-trained DL network and re-train only the last one or last couple layers if the dataset is small, keeping all the other layers’ weights fixed
used from a problem wiht a lot of data to a problem with much less data

multi-task learning

Study These Flashcards

used much less than transfer learning
using one DL network to solving several tasks by using multiple outputs for the output layer i.e. detecting car, image, traffic sign in an image
multi-label problem using logistic-regression loss instead of the softmax for the one label output

incomplete output labels - multi-task learning

- for samples with incomplete outptput labels calculate the loss only using the available output components i.e. only with labels for car and pedestrian if traffic signal is missing

when to use multi-task learning

- it's beneficial to have shared low-level features and multi-label outputs - (sometimes): when the amount of data for the datasets candidates for transfer-learning is very similar - when is possible to train a single bigger DL network that will do well on all tasks

end-to-end DL

- replacing a multiple stages of processing in an ML system with a single, usually huge, DL network - in practice an intermmediate solution, i.e. 2 steps, most of the time works better i.e.. detect the face in an image, zoom in and detect person identity - there is more data for each of the 2 subtasks than for an end-to-end DL network

pros and cons - end-to-end DL

``` pros: - let the data speak - less hand-designed components cons: - requires large amounts of data - filters out potentially useful hand-design components when less data is available; this will compensate with human knowledge for the lack of data ```

sources of knowledge

- data: can be used exclusively when large amount of data is available - human: useful hand-designed components when the amount of data is rather limited

ML_Projects (Coursera) Flashcards

(29 cards)