WHen you move the image in particular ways across the image, the featirues also move. The feature detection in the output maps also move in the same way as you move the input.
Seen at the final output. Even if you move or transpose the final digit across the image, it will still output that it is a four. There is some rotation invariance, but at some point if you rotated the digit too much, it would stop predicting that it is a four. There is also scale invariance, which means if you increase the size or decrease the size of the image, the network will be able to predict the size
CNN layer 11x11 and relu actiivation with stride of 4 3x3 max pooling layer with 2 stride a conv layer of 5x5 2 padding a max poo 3x3 with 2 stride 3 conv layers 3x3 with 1 pad and relu activation Pool laer of 3x3 4096 fully connected 4096 fully connected 1000 fully connected
did repeated 3x3 and 2x2 max pooling layers. ENd result had a lot of parameters
Used parallel filters, filters of different sizes to get features at multiple scales. Used filter concatenation, smaller 1x1 filters, 3x3 filters, 5x5 filters, and max pooling.
Even if NN can perfectly model the world, it may not be able to find good weights that model the fun
Even of we find the best hypothesis that minimized training error, it doesnt mean we will be able to generalize well on the test set
Given a NN architecture, the actual model that represents the real world may not be in that space. There may be no set of weights that model the real world.
Cases where transfer learning doesnt work well
IF teh source dataset you train on is very different from the target dataset, transfer learning is not as effective. If you have enough data for the target domain, it won’t perform better, it will just result in faster convergence
SHows sensitivity of loss to individual pixel changes. Marge sensitivity implies important pixels. In practice instead of finding the loss, we find the gradient of the classifier scores(pre softmax) before softmax. We then take the absolute value of the gradient. We can also sum across all channels.
Backprop for visualization
Normal backprop isn’t best choice for visualization. We may only get parts of the image that decrease the feature activation. There are probably lots of such input pixels. Guided backprop can be used to improve visualizations. Take gradient and only pass back pos gradient. SInce this changes how we do backprop, we do guided back prop - we zero out 0 and negative gradient.
Inputs formed by applying small, but intentionally worst-case pertubations to examples, such that the perturbed input results in the model outputting an incorrect answer
CEL, MSE, L1, L2
Object detection architectures