Matteucci_1 Flashcards
(26 cards)
When is correct to talk of machine learning (ML)?
When program improves measure of performance over certain task thanks to experience
Which are the main types of ML techniques?
- Supervised: given an output learn to get correct input
- Unsupervised: exploit regularities in dataset to build representation
- Reinforcement: produce environment-affecting actions to maximize reward received as consequence of those actions
What is deep learning (DL)?
Use of ML to learn the data representation from data themselves
What is an artificial neural network (ANN)?
Computing system composed of units called neurons and inspired to biological neural networks in brains
How neurons are modeled?
Each neuron characterized by activation function (ie function with threshold), usually sigmoidal functions and differentiable, eg: sigmoid, tanh, relu…
The function is triggered on input = weighted sum of previous neurons output
What is a feed-forward ANN (FFNN)? When a FFNN is said to be fully connected (FC)?
ANN in which connections between neurons don’t form any cycle
FC when each neuron in each layer is connected with each neuron in both the previous and the next layer, ie for each of these pairs a weight exists
NB: definition independent on number of layers
What is a perceptron (PCP)?
Algorithm to learn a binary classifier. ANN composed of a single neuron
What is a multi-layer perceptron (MLP) and why is it needed?
FC FFNN composed of many PCP, one per layer, called input if first, output if last, hidden if in the middle.
Introduced to learn non-linear functions, since single PCP converges iff input is linearly separable.
What is Hebbian learning (HL)? How does it applies to ANNs?
Fact that simultaneous activation of input neurons and (desired) output neurons leads to strengthening of synapse between them
Locally limited, so it applies only to synaptic level (ie takes into consideration 2 neurons at time), hence doesn’t apply to MLP (cannot take into consideration all weights for a single neuron)
Which are the most common problems for supervised learning?
- Regression: learn a continuous function (linear)
- Classification: 2-class (sigmoid, tanh) or k-class (softmax=conversion of vector into probability distribution)
What is the universal approximation theorem?
What are the practical limitations of it?
Single hidden layer FFNN with non-linear activation function can approximate ANY measurable function to ANY desired degree of accuracy (on compact set)
Limitations:
1. Convergence of weights NOT decidable (oscillate forever)
2. Exponential number of PCP may be required
3. May fail to generalize
What is a loss function? What are the most common ones?
Measure of error between output and target for each training sample
Common ones:
1. Mean squared error (MSE): regression
2. Mean absolute error (MAE): regression
3. Binary cross-entropy: binary classification
4. Categorical cross-entropy: k-class classification
What is gradient descent (GD)? What are its limitations and which are the corresponding mitigations?
ITERATIVE method to search a min of loss function
Each iteration computes the difference between the weights and the gradient of loss over weights scaled over a learning rate and uses it to update the weights:
Wₖ₊₁ = Wₖ - η·∇LOSS(Wₖ)
(η = learning rate (LR))
Limitations:
1. May have no closed-form solution => gradually reduce LR
2. May reach local minimum => add momentum
How often the weights are updated with respect to the number of samples?
- Batch: GD applied after loss computed over all samples, high accuracy, high memory demand
- Stochastic (SGD): GD applied after loss computed over single sample, unbiased, BUT high variance
- Mini-batch: GD applied after loss computed over some samples (subset), tradeoff bias-variance and memory resources
What is the chain rule and why is it important?
Mathematical property that allows to compute the derivative of composed functions as the product of nested derivatives
Important because allows during forward pass to compute derivatives that will be stored and used during the backpropagation
What is backpropagation?
Computation of gradient starting from output layer, backward to input, using pre-computed derivatives
What is the maximum likelihood estimation (MLE) and how is it used?
Method to estimate parameters of distribution given samples assumed to be generated by that distribution
In particular: assume distribution for given samples (unknown parameters), compute joint probability (likelihood function) of samples & parameters (function of parameters), find parameters that maximize that probability
Obtained distribution used as loss function
Which are the most common assumed distributions with MLE and which are their resulting loss functions?
- Gaussian noise => sum of squared errors (SSE)
- Bernoulli (binary classification) => cross-entropy
- Categorical distribution (k-class classification) => (k-class) categorical cross-entropy
What are underfitting and overfitting and how they can be spotted during training?
Underfitting: use of too simple model => can’t accurately represent real system (high bias)
Training and validation errors close, but high
Overfitting: use of too complex model => can’t generalize with respect to samples (high variance)
Training and validation errors separate (training goes to 0 more quickly)
Why usually dataset are splitted, in which way and what is the role of each split set?
Splitted to better generalize over samples (using different subsets) and obtain better estimators than training error
Main principle: estimate ONLY on NOT training data
Most common split:
dataset =
{ 0.8training_data = [0.8training_set + 0.2validation_set] +
0.2testing_data = [test_set] }
(validation_data = validation_set + test_set)
training_set = samples used to train model
validation_set = samples used to obtain more accurate estimation of error during training
(only training_data used during model development)
testing_set = samples used when model fully train to evaluate performance on unseen samples
Which are the most common techniques to perform the splitting of the dataset and to which splitting are applied between validation splitting and testing splitting?
- Hold-out: just separate in two subsets, use one for training and other for estimation => faster, not much meaningful (for generalization)
ALWAYS testing, RARELY validation - Leave-one-out (LOO): separate 1 sample for estimation and use rest for training, repeat for each sample => extremely meaningful (for generalization), extremely slow (high resource demand)
RARELY both testing and validation - K-folds cross-validation: separate into K subsets, use 1 subset for estimation and others for training => good tradeoff generalization-resource use
SOMETIME testing (more stable evaluation), USUALLY validation
Which are the most common techniques to avoid overfitting?
- Early stopping: stop training when validation error stops decreasing/increases
NB: NOT training error because monotonic - Weight decay (regularization): artificially constrain model freedom (keep weights small) using Bayesian approach (uses conditional probabilities)
NB: introduces regularizing term with factor γ∈[0,+∞) (0=overfitting, ∞=underfitting), tuned using validation error, when fixed retrain over all training_data - Dropout (stochastic regularization): randomly switch off neurons to obtain weaker network
Train many weaker networks on different mini-batches and average results (similar concept to boosting)
Which are the most common activation functions and which are their pros and cons?
- Logistic (sigmoid) & tanh
PROS: differentiable, zero-centered output (tanh requires shift)
CONS: tend to saturate => vanishing gradient - Rectified linear unit (RELU) = max{0,x}
PROS: faster SGD convergence, sparse activation (some weights put to 0 => less overfitting), no vanishing gradient, efficient computation (simple function), scale invariant
CONS: non differentiable in 0 (solved putting f’(0)=0), non zero-centered output, unbounded => exploding gradient, for each x<0 f(x)=0 => dying neurons (f’(x)=0) - Leaky RELU = { x (if x>0) | εx (if x<0) } (0<ε«1)
PROS: fix dying neurons of RELU
CONS: others of RELU - ELU = { x (if x>0) | α(e×-1) (if x<0) }
PROS: fix dying neurons & non differentiability of RELU
CONS: others of RELU, α need “by-hand” tuning
What are bad and good ways of initialize the weights of an ANN?
BAD:
1. Zeros: no learning from start
2. Big values: difficult convergence
3. Gaussian N(0,ε): small variance => vanishing gradient if deep NN
GOOD:
1. Xavier: for each layer N(0,σ) with σ = 1/#IN to keep VAR(OUT)=VAR(IN)·#IN·VAR(W)≈VAR(IN)
2. Glorot&Bengio: for each layer N(0,σ) with σ = 2/[#IN+#OUT]
3. He: for each layer N(0,σ) with σ = 2/#IN, more useful with RELU