Boracchi Flashcards
(11 cards)
What is image classification?
Classification task where input is composed by images having dimensions HxWxC
H = height (in pixels)
W = width (in pixels)
C = channels, usually number of (color) levels composing image (RGB => C=3, hyperspectral C>3)
What is a local (spatial) transformation?
How is it computed if linear?
Operation performed on each pixel (r,c) of input image
OUT(r,c) = TᵁIN
Output obtained at pixel (r,c) given by transformation T (linear or non-linear) applied to pixel (r,c) of image IN using neighborhood U (displacement (u,v) to define dimensions of area centered in (r,c))
NB: position of output with respect to other pixels doesn’t change
If linear: OUT(r,c) = Σ W(u,v)*IN(r+u,c+v)
where W(u,v) is interpretation of weights as filter applied over original pixels
What is correlation among a filter W and an image IN and how is it used?
<IN, W> = ΣΣ W(u,v)·IN(r+u,c+v) over a square centered in (r,c)
Used for template matching: area with values of correlation matching with template is right one
NB: normalization is needed to avoid saturation to max/min values (eg: with white/black background)
How does an image classification task work using an ANN?
Input image fed unfolded as H·W·C array
Output score = weighted sum of pixels
Final weights identify semi-hyperplane in feature space for each class
NB: weights can be represented as image (used as template)
Which are the main difficulties faced during image classification?
- High dimensionality of dataset (eg: 32x32x3=3K+ values)
- Label ambiguity: single label may not fully identify an image (=> multi-label classification)
- (Small) transformations may DRASTICALLY change image values without changing class of belonging (eg: saturation, contrast, deformations, POVs, occlusion, background interference, scale) (=> need training robust to transformations)
- Intra-class variability: totally different images belonging to same class
- Perceptual similarity: image with same pixel-wise distance can be perceptually (very) different (=> K-NN help if K»1, but impractical because of size of set to be used in testing)
What are features and how they can be found?
High level patterns representing meaningful information, useful to reduce dimension of data
1. Hand-crafted: only for very simple and specific problems
PROS: allow embedding of prior knowledge, easy interpretation, require small dataset
CONS: difficult(/impossible) to design for slightly complex tasks, not general/portable (=> overfitting risk)
2. Data-driven: state of art for feature extraction
PROS: high generality/portability, can capture extremely hidden patterns
CONS: usually not interpretable, difficult to embed prior knowledge
What is a convolutional NN (CNN) and how is it composed?
ANN used to perform feature extraction
Convolution = linear transformation (as correlation, but with switched sign) used to reduce volume (=> #parameters) of input as CNN depth increases
Composed by:
1. Convolutional layers: provide linear combination of pixel values applying filter over whole input image
NB: #parameters = #filters·(Hᶠ·Wᶠ·C+1)
2. Activation layers: introduce non-linearities with scalar functions
Don’t change volume size
3. Pooling layers: reduce volume operating on each channel independently
4. Dense layers: flatten input (spatial dimension lost), compose MLP ANN
NB: #parameters = #OUT·(#IN+1)
Usually CNN = feature extraction network (FEN) + FC = (convolutional–>activation–>pooling)⁺–>(dense)⁺
Sub-sampling (to change only some weights & reduce dimension) performed by activation (thresholding) & pooling (down-sampling) layers => convolutional part (FEN) allows to reduce #parameters with respect to only MLP
How does a convolutional layer work?
Use padding to deal with image boundaries: valid (no padding), same (half padding = frame of 1 pixel), full (full padding = frame of Hᶠ-1=Wᶠ-1 pixels)
For each filter (independently from #filter_channels = Cᶠ = C) generates 1 output map with C=1 (=> #output maps = #filters)
MEMO: #parameters = #filters·(Hᶠ·Wᶠ·C+1)
How does a pooling layer work?
2 types:
1. Max pooling: takes maximum value of pixels under filter area
2. Avg pooling: takes the average value of pixels under filter area
Filters applied without overlapping (=> no reduce volume), Hᶠ=Wᶠ=stride
MEMO: #parameters = #OUT·(#IN+1)
What is the receptive field of a HᵐxWᵐ output map?
Region of input affecting the output map through filters applied, wider as deeper the network
For each layer:
1. Convolutional filter: Hʳᶠ = Hᵐ + Hᶠ -1
2. Pooling layer: Hʳᶠ = Hᵐ·Hᵖ
Which are the main differences between a CNN and an MLP?
- Sparse connectivity: output pixels connected only with input pixels in their receptive field
- Weight sharing: weights don’t change between applications of same filter (=> all neurons in same slice generated by that filter share same weights and bias => feature extraction insensitive to localization)