Boracchi Flashcards

Question 1

Q

What is image classification?

Answer

A

Classification task where input is composed by images having dimensions HxWxC
H = height (in pixels)
W = width (in pixels)
C = channels, usually number of (color) levels composing image (RGB => C=3, hyperspectral C>3)

Question 2

Q

What is a local (spatial) transformation?
How is it computed if linear?

Answer

A

Operation performed on each pixel (r,c) of input image
OUT(r,c) = TᵁIN
Output obtained at pixel (r,c) given by transformation T (linear or non-linear) applied to pixel (r,c) of image IN using neighborhood U (displacement (u,v) to define dimensions of area centered in (r,c))
NB: position of output with respect to other pixels doesn’t change

If linear: OUT(r,c) = Σ W(u,v)*IN(r+u,c+v)
where W(u,v) is interpretation of weights as filter applied over original pixels

Question 3

Q

What is correlation among a filter W and an image IN and how is it used?

Answer

A

<IN, W> = ΣΣ W(u,v)·IN(r+u,c+v) over a square centered in (r,c)
Used for template matching: area with values of correlation matching with template is right one
NB: normalization is needed to avoid saturation to max/min values (eg: with white/black background)

Question 4

Q

How does an image classification task work using an ANN?

Answer

A

Input image fed unfolded as H·W·C array
Output score = weighted sum of pixels
Final weights identify semi-hyperplane in feature space for each class
NB: weights can be represented as image (used as template)

Question 5

Q

Which are the main difficulties faced during image classification?

Answer

A

High dimensionality of dataset (eg: 32x32x3=3K+ values)
Label ambiguity: single label may not fully identify an image (=> multi-label classification)
(Small) transformations may DRASTICALLY change image values without changing class of belonging (eg: saturation, contrast, deformations, POVs, occlusion, background interference, scale) (=> need training robust to transformations)
Intra-class variability: totally different images belonging to same class
Perceptual similarity: image with same pixel-wise distance can be perceptually (very) different (=> K-NN help if K»1, but impractical because of size of set to be used in testing)

Question 6

Q

What are features and how they can be found?

Answer

A

High level patterns representing meaningful information, useful to reduce dimension of data
1. Hand-crafted: only for very simple and specific problems
PROS: allow embedding of prior knowledge, easy interpretation, require small dataset
CONS: difficult(/impossible) to design for slightly complex tasks, not general/portable (=> overfitting risk)
2. Data-driven: state of art for feature extraction
PROS: high generality/portability, can capture extremely hidden patterns
CONS: usually not interpretable, difficult to embed prior knowledge

Question 7

Q

What is a convolutional NN (CNN) and how is it composed?

Answer

A

ANN used to perform feature extraction
Convolution = linear transformation (as correlation, but with switched sign) used to reduce volume (=> #parameters) of input as CNN depth increases

Composed by:
1. Convolutional layers: provide linear combination of pixel values applying filter over whole input image
NB: #parameters = #filters·(Hᶠ·Wᶠ·C+1)
2. Activation layers: introduce non-linearities with scalar functions
Don’t change volume size
3. Pooling layers: reduce volume operating on each channel independently
4. Dense layers: flatten input (spatial dimension lost), compose MLP ANN
NB: #parameters = #OUT·(#IN+1)

Usually CNN = feature extraction network (FEN) + FC = (convolutional–>activation–>pooling)⁺–>(dense)⁺
Sub-sampling (to change only some weights & reduce dimension) performed by activation (thresholding) & pooling (down-sampling) layers => convolutional part (FEN) allows to reduce #parameters with respect to only MLP

Question 8

Q

How does a convolutional layer work?

Answer

A

Use padding to deal with image boundaries: valid (no padding), same (half padding = frame of 1 pixel), full (full padding = frame of Hᶠ-1=Wᶠ-1 pixels)
For each filter (independently from #filter_channels = Cᶠ = C) generates 1 output map with C=1 (=> #output maps = #filters)

MEMO: #parameters = #filters·(Hᶠ·Wᶠ·C+1)

Question 9

Q

How does a pooling layer work?

Answer

A

2 types:
1. Max pooling: takes maximum value of pixels under filter area
2. Avg pooling: takes the average value of pixels under filter area
Filters applied without overlapping (=> no reduce volume), Hᶠ=Wᶠ=stride

MEMO: #parameters = #OUT·(#IN+1)

Question 10

Q

What is the receptive field of a HᵐxWᵐ output map?

Answer

A

Region of input affecting the output map through filters applied, wider as deeper the network
For each layer:
1. Convolutional filter: Hʳᶠ = Hᵐ + Hᶠ -1
2. Pooling layer: Hʳᶠ = Hᵐ·Hᵖ

Question 11

Q

Which are the main differences between a CNN and an MLP?

Answer

A

Sparse connectivity: output pixels connected only with input pixels in their receptive field
Weight sharing: weights don’t change between applications of same filter (=> all neurons in same slice generated by that filter share same weights and bias => feature extraction insensitive to localization)

Boracchi Flashcards

(11 cards)