Matteucci_1 Flashcards by Nicolo' Fontana

When is correct to talk of machine learning (ML)?

When program improves measure of performance over certain task thanks to experience

How well did you know this?

Not at all

Perfectly

Which are the main types of ML techniques?

Supervised: given an output learn to get correct input
Unsupervised: exploit regularities in dataset to build representation
Reinforcement: produce environment-affecting actions to maximize reward received as consequence of those actions

How well did you know this?

Not at all

Perfectly

What is deep learning (DL)?

Use of ML to learn the data representation from data themselves

How well did you know this?

Not at all

Perfectly

What is an artificial neural network (ANN)?

Computing system composed of units called neurons and inspired to biological neural networks in brains

How well did you know this?

Not at all

Perfectly

How neurons are modeled?

Each neuron characterized by activation function (ie function with threshold), usually sigmoidal functions and differentiable, eg: sigmoid, tanh, relu…
The function is triggered on input = weighted sum of previous neurons output

How well did you know this?

Not at all

Perfectly

What is a feed-forward ANN (FFNN)? When a FFNN is said to be fully connected (FC)?

ANN in which connections between neurons don’t form any cycle
FC when each neuron in each layer is connected with each neuron in both the previous and the next layer, ie for each of these pairs a weight exists
NB: definition independent on number of layers

How well did you know this?

Not at all

Perfectly

What is a perceptron (PCP)?

Algorithm to learn a binary classifier. ANN composed of a single neuron

How well did you know this?

Not at all

Perfectly

What is a multi-layer perceptron (MLP) and why is it needed?

FC FFNN composed of many PCP, one per layer, called input if first, output if last, hidden if in the middle.
Introduced to learn non-linear functions, since single PCP converges iff input is linearly separable.

How well did you know this?

Not at all

Perfectly

What is Hebbian learning (HL)? How does it applies to ANNs?

Fact that simultaneous activation of input neurons and (desired) output neurons leads to strengthening of synapse between them
Locally limited, so it applies only to synaptic level (ie takes into consideration 2 neurons at time), hence doesn’t apply to MLP (cannot take into consideration all weights for a single neuron)

How well did you know this?

Not at all

Perfectly

Which are the most common problems for supervised learning?

Regression: learn a continuous function (linear)
Classification: 2-class (sigmoid, tanh) or k-class (softmax=conversion of vector into probability distribution)

How well did you know this?

Not at all

Perfectly

What is the universal approximation theorem?
What are the practical limitations of it?

Single hidden layer FFNN with non-linear activation function can approximate ANY measurable function to ANY desired degree of accuracy (on compact set)

Limitations:
1. Convergence of weights NOT decidable (oscillate forever)
2. Exponential number of PCP may be required
3. May fail to generalize

How well did you know this?

Not at all

Perfectly

What is a loss function? What are the most common ones?

Measure of error between output and target for each training sample
Common ones:
1. Mean squared error (MSE): regression
2. Mean absolute error (MAE): regression
3. Binary cross-entropy: binary classification
4. Categorical cross-entropy: k-class classification

How well did you know this?

Not at all

Perfectly

What is gradient descent (GD)? What are its limitations and which are the corresponding mitigations?

ITERATIVE method to search a min of loss function
Each iteration computes the difference between the weights and the gradient of loss over weights scaled over a learning rate and uses it to update the weights:
Wₖ₊₁ = Wₖ - η·∇LOSS(Wₖ)
(η = learning rate (LR))

Limitations:
1. May have no closed-form solution => gradually reduce LR
2. May reach local minimum => add momentum

How well did you know this?

Not at all

Perfectly

How often the weights are updated with respect to the number of samples?

Batch: GD applied after loss computed over all samples, high accuracy, high memory demand
Stochastic (SGD): GD applied after loss computed over single sample, unbiased, BUT high variance
Mini-batch: GD applied after loss computed over some samples (subset), tradeoff bias-variance and memory resources

How well did you know this?

Not at all

Perfectly

What is the chain rule and why is it important?

Mathematical property that allows to compute the derivative of composed functions as the product of nested derivatives
Important because allows during forward pass to compute derivatives that will be stored and used during the backpropagation

How well did you know this?

Not at all

Perfectly

What is backpropagation?

Study These Flashcards

Computation of gradient starting from output layer, backward to input, using pre-computed derivatives

What is the maximum likelihood estimation (MLE) and how is it used?

Study These Flashcards

Method to estimate parameters of distribution given samples assumed to be generated by that distribution
In particular: assume distribution for given samples (unknown parameters), compute joint probability (likelihood function) of samples & parameters (function of parameters), find parameters that maximize that probability

Obtained distribution used as loss function

Which are the most common assumed distributions with MLE and which are their resulting loss functions?

Study These Flashcards

Gaussian noise => sum of squared errors (SSE)
Bernoulli (binary classification) => cross-entropy
Categorical distribution (k-class classification) => (k-class) categorical cross-entropy

What are underfitting and overfitting and how they can be spotted during training?

Study These Flashcards

Underfitting: use of too simple model => can’t accurately represent real system (high bias)
Training and validation errors close, but high
Overfitting: use of too complex model => can’t generalize with respect to samples (high variance)
Training and validation errors separate (training goes to 0 more quickly)

Why usually dataset are splitted, in which way and what is the role of each split set?

Study These Flashcards

Splitted to better generalize over samples (using different subsets) and obtain better estimators than training error
Main principle: estimate ONLY on NOT training data

Most common split:
dataset =
{ 0.8training_data = [0.8training_set + 0.2validation_set] +
0.2testing_data = [test_set] }
(validation_data = validation_set + test_set)

training_set = samples used to train model
validation_set = samples used to obtain more accurate estimation of error during training
(only training_data used during model development)
testing_set = samples used when model fully train to evaluate performance on unseen samples

Which are the most common techniques to perform the splitting of the dataset and to which splitting are applied between validation splitting and testing splitting?

Study These Flashcards

Hold-out: just separate in two subsets, use one for training and other for estimation => faster, not much meaningful (for generalization)
ALWAYS testing, RARELY validation
Leave-one-out (LOO): separate 1 sample for estimation and use rest for training, repeat for each sample => extremely meaningful (for generalization), extremely slow (high resource demand)
RARELY both testing and validation
K-folds cross-validation: separate into K subsets, use 1 subset for estimation and others for training => good tradeoff generalization-resource use
SOMETIME testing (more stable evaluation), USUALLY validation

Which are the most common techniques to avoid overfitting?

Study These Flashcards

Early stopping: stop training when validation error stops decreasing/increases
NB: NOT training error because monotonic
Weight decay (regularization): artificially constrain model freedom (keep weights small) using Bayesian approach (uses conditional probabilities)
NB: introduces regularizing term with factor γ∈[0,+∞) (0=overfitting, ∞=underfitting), tuned using validation error, when fixed retrain over all training_data
Dropout (stochastic regularization): randomly switch off neurons to obtain weaker network
Train many weaker networks on different mini-batches and average results (similar concept to boosting)

Which are the most common activation functions and which are their pros and cons?

Study These Flashcards

Logistic (sigmoid) & tanh
PROS: differentiable, zero-centered output (tanh requires shift)
CONS: tend to saturate => vanishing gradient
Rectified linear unit (RELU) = max{0,x}
PROS: faster SGD convergence, sparse activation (some weights put to 0 => less overfitting), no vanishing gradient, efficient computation (simple function), scale invariant
CONS: non differentiable in 0 (solved putting f’(0)=0), non zero-centered output, unbounded => exploding gradient, for each x<0 f(x)=0 => dying neurons (f’(x)=0)
Leaky RELU = { x (if x>0) | εx (if x<0) } (0<ε«1)
PROS: fix dying neurons of RELU
CONS: others of RELU
ELU = { x (if x>0) | α(e×-1) (if x<0) }
PROS: fix dying neurons & non differentiability of RELU
CONS: others of RELU, α need “by-hand” tuning

What are bad and good ways of initialize the weights of an ANN?

Study These Flashcards

BAD:
1. Zeros: no learning from start
2. Big values: difficult convergence
3. Gaussian N(0,ε): small variance => vanishing gradient if deep NN

GOOD:
1. Xavier: for each layer N(0,σ) with σ = 1/#IN to keep VAR(OUT)=VAR(IN)·#IN·VAR(W)≈VAR(IN)
2. Glorot&Bengio: for each layer N(0,σ) with σ = 2/[#IN+#OUT]
3. He: for each layer N(0,σ) with σ = 2/#IN, more useful with RELU

Which are common techniques to deal with shifted distributions and what are their pros?

1. Whitening = reshape samples distribution as N(0,1) PROS: improve convergence speed 2. Uncorrelation = re-weights sample to remove covariate shift (training distribution different from test one) PROS: improve accuracy 3. Batch normalization = force input as N(0,1) for each batch to remove internal covariate shift Taken into account by backpropagation and applied at test time using μ and σ computed over all training_data PROS: directly implemented as layer, improve gradient flow, allow higher LR, reduce dependence on weights initialization, apply light regularization

Which are the main 2 optimization that can be used on GD and which are some examples of implementation of them?

1. Momentum (-α∇LOSS(Wₖ₋₁)) to avoid being stuck in local minima Nesterov accelerated gradient (NAG): apply momentum before computing new gradient 2. Adaptive LR (reducing and different for each layer) to avoid endless oscillations over a minimum Avoid risk of vanishing gradient for early layers, manage LR of different magnitudes Rprop, AdaGrad, RMSprop, Adam

Matteucci_1 Flashcards

(26 cards)