ML_Projects (Coursera) Flashcards
(29 cards)
Chain of assumptions - orthogonalization
Concept similar to independent (orthogonalized) TV-set adjustment knobs - one knob, one problem fix
- fit training set well on COST function; knobs: bigger network, different optimization algo
- fit dev set well on COST function; knobs: regularization, bigger training set
- fit test set well on COST function; knobs: bigger dev set
- perform well in live settings; knobs: change dev set, change COST function
Early-stopping cons
It is a ‘knob’ with 2 functions, contradicts orthogonalization:
- adjusts the depth of a DL network
- adjusts regularization
Using metric goals
- use a single metric for project evaluation; in combination with a well-defined DEV set it will speed-up the iterations
i. e. F1 instead of precision, recall; average for error across geo regions - use an optimizing metric subject to one or more satisficing metrics (does better than a threshold)
- change the evaluation metric (and DEV set) when no longer ranks correctly the estimator performance
i. e. add a weight that will increase the error for samples to help the classifier’s discriminative power
Using TRAIN/DEV/TEST sets
- use the DEV set (to optimize the estimator) and the TEST set (to evaluate the generalization error) that come from the SAME distribution (‘aiming at target’ paradigm)
- for big data (i.e. 1 mil samples) split: 98% train, 1%dev, 1% test
- for normal data (i.e. thousands samples) split: 60%-20%-20%
- pick up the size of TEST set to have high confidence in the final evaluation
Orhtogonalization for metric
- define the metric that will correctly capture the information for the estimator
- figure out how to do well (optimize) - is the estimator performing for the metric
- if doing well using DEV test + metric is not translating in doing well using application, change metric and/or DEV set
Human-level performance
- an algorithm may surpass human-level performance => will slow down in its performance approaching but never reaching the Bayes optimal error
- in many tasks (natural data tasks) human-level performance is very close to Bayes optimal error
Improving algo to human-level performance
- get human-labeled data
- do manual error analysis: find out why human does better and incorporate into algo
- better bias/variance analysis
Bias/variance
- bias reduction tactics (diff estimator, larger DL network) when the training error is far from human-level error used as a proxy for Bayes error
- variance reduction tactics (regularization, larger training set) when the DEV set error is far from TRAIN error
- avoidable bias = TRAIN err - HL err
- variance = DEV err - TRAIN err
Human-level performance
The importance of HL performance is in its use as a Bayes error in human perception tasks; also, in other papers the state-of-the-art can also be seen as a proxy for the Bayes error
Once the HL performance is surpased is much more difficult to improve the ML algorithm
Techniques for supervised learning
- doing well on training set (how good are the assumptions): small avoidable bias; if not then:
- train another model
- train longer/better optimization algo (add momentum, RMSprop, adam)
- diff NN architecture, hyperparam search - doing well on DEV/test sets (generalizes well): small variance; if not then:
- more data
- regularization: l2, dropout, data augmentation
- diff NN architecture, hyperparam search
Simple error analysis
- do an analysis on 100 random mislabelled examples, FP and FN, and find the counts for different categories of errors, for example 5 dogs;
- the relative percentages (“ceiling” on performance) will represent how much the actual performance could be improved, for 5 of 100 from 10% to 9.5%, 50 of 100 from 10% to 5%.
- results would suggest which options to pursue for improvements
Incorrectly labeled samples - training set
- DL algos are robust to (near) random errors in the training set
- DL algos are not robust at systematic errors in trainging set i.e. misclassifying the same type of image (white dogs = cats)
Incorrectly labeled samples - dev set
- fix labels when incorrect label errors are a significant percentage from the overall dev set errors i.e. 0.6% ~ 30% of 2%
- having a significant percentage of incorrect labels in dev set is bad for the goal of selecting between 2 models as one cannot trust the dev set
review corrected dev and test sets
- !!! make sure that they are still from the same distribution
- review examples that algo got both right and wrong: 100 or 200 examples
- it’s ok that train set distribution may be slightly different
different training and testing data distributions - bad option
- i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
- bad option: mix and shuffle the original sets followed by random split will generate sets that in expectation will preserve the original ratios i.e. 95% items of the dev set will still be from the original training set to optimize against
different training and testing data distributions - good option
- i.e. sharp (large set, 200k) vs blurred images (small set, 10k)
- good option: training set: add half (5k) of the small set to large set (200k), and test set: split the remaining small set into dev and test sets (2.5k and 2.5k) to optimize against
- this ensures that the model is optimized against the ‘target’ (real life app) and dev & test are from the same distribution!!!
handling mismatched data distribution for training and dev sets
- counter 2 effects on the results: generalization (variance) and data mismatch problems
- generalization/variance: use a training-dev (split from training) to ensure that the training-dev results are based on the same distribution
- data mismatch: compare the training-dev results with dev results to spot this problem
performance/error levels - mismatched data distributions
- human level (or state-of-the-art) error: HLE
- training set error: TRE
- training-dev set error: TRDE
- dev set error: DE
- test set error: TSE
bias/variance - mismatched data distributions
avoidable bias = TRE - HLE
variance: TRDE -TRE
data mismatch: DE - TRDE
overfitting to dev set: TSE -DE
table representation - bias/variance analysis data-mismatch
- columns: training distribution data, dev/test distribution data
- rows: human level err, data trained on err, data not trained on err
(1, 1) = HLE, (2, 1) = TRE, (3, 1) = TRDE, (3, 2) = DE or TSE
optional (1, 2) = human level on dev/test
addressing data mismatch training/test sets
- do a manual error analysis to understand the differences
- make the training data more similar i.e. synthetic data - with the caveat that it can overfit to the small synthetic data set - 1 hour of car noise that is a too small sample
- collect training data into conditions similar to dev/test sets
building the first system
- setup a ‘target’: dev + test set and a metric
- build the first model/system quick and dirty
- use Bias/Variance and Error analysis to decide on next steps
- iterate prioritizing and improving the system
- if the system is fairly new, DON’T overthink!! to get it going
- if is an existing body of knowledge, is ok to start with that but still DON’T overthink
transfer learning
- use a pre-trained DL network and re-train only the last one or last couple layers if the dataset is small, keeping all the other layers’ weights fixed
- used from a problem wiht a lot of data to a problem with much less data
multi-task learning
- used much less than transfer learning
- using one DL network to solving several tasks by using multiple outputs for the output layer i.e. detecting car, image, traffic sign in an image
- multi-label problem using logistic-regression loss instead of the softmax for the one label output