05_supervised learning methods Flashcards
(46 cards)
What is a linear model?
they assume linearity in the underlying data
rather simple but convey many of the concepts utilized in other, more complex models
eg linear regression, linear classification
What does a linear regression do)
find weights w0 and w1
so that the linear function f(x)=w1 º x + w0
with input x and output y
best fits the data containing ground-truth values y’
–> how can we learn w0 and w1 from data?
How do linear regression models learn the weights for function f?
minimize square errors of prediction
with respect to ground truth
for each data point:
Least Square fitting
for data point j: [yj’ - f (wj, w0, w1) ] ^2
How does least squares fitting work?
1) define a loss (objective) function that is the sum of the squared errors over all data points
2) find the best-fit model parameters by minimizing the loss function with respect to those two model parameters
(first derivative –> closed-form expression for best-fit w0 and w1)
least-squares + linear model function: the resulting minimum of the loss function is GLOBAL (bc of the combo)
–> the model learns immediately the best-possible solution
When can linear functions be used as a classifier?
When the data is linearly separable (and only then!)
How do linear functions work as a classifier?
1) define decision boundary
f(x,w) = wx = w0 + w1x1 + w2x2
such that class 1: f(x,w) ≥ 0
and class 0: f(x,w) < 0
2) we can define class assignments through a threshold function
What is the perceptron learning rule?
weights are adjusted by a step size that is called the LEARNING RATE.
by iteratively running this algorithm over your training data multiple times, the weights can be learned so that the model performs properly
–> solution is learned iteratively
–> does not imply that the model is learning something useful (eg dataset might not be suitable)
Why can linear models not often be applied?
they have low predictive capacity
–> can only be applied to data that is linearly distributed
How can linear models be fit for more capacity?
the base function can be changed to a polynomial:
f(x) = w0 + w1x + w2x^2 etc
Summe wix^i
the resulting regression problem is still linear in the weights to be found, therefore the same properties apply:
we can compute the parameters wi to minimize the loss with a closed-form expression. this is by default the best possible solution
What can be changed for a polynomial linear function model in order to get more capacity?
p (which says how many weights and x we have. have to find the best p)
–> when p is too low, the decision boundary looks like a constant, if it’s too high it can also be not a good fit (underfitting vs overfitting)
What does Occam’s Razor mean?
a model with fewer parameters is to be preferred
What is the goal of any regression model?
to minimize the loss over the data, ie prediction errors
What is a way to prevent a model from overfitting?
regularize the loss based on the learned weights
L’(x,w) = mean loss over N samples + regularization term
What it the L2-norm?
||w||2 2 = w * w
What happens when you add a L2 regularization term to a polynomial model?
The regularization term takes the weights and adds it to the loss function
–> the loss function cannot go as low as it wants to
with increasing alpha, all coefficients wi drop in magnitude, leading to smoother fits
What is L2 regularization also called?
ridge regression
What is L1 also called?
LASSO regression
least absolute shrinkage and selection operator
What is the L1-norm?
||w||1 = |w|
What does a regularization term consist of?
alpha and L2/L1-norm
alpha: regularization parameter
What happens when you add a L1 regularization term to a polynomial model?
while L2 regularization modulates all coefficients wi in the same way,
L1 regularization aims to set less meaningful coefficients to zero
–> performs feature selection
tries to bring as many of the coefficients to 0 as it can
How can we get closest to the minimum loss with a L2 norm in a 2D loss space spanned by w1 and w2?
we can only reach points on a circle to find the point that puts us closest to the global minimum
this would be inefficient if the global minimum is inside of the circle
How can we get closest to the minimum loss with a L1 norm in a 2D loss space spanned by w1 and w2?
we can only reach points on a triangle defined by the two weights
Which type of regularization should we use? L1 or L2?
- L2 regularization prevents the model from overfitting by modulating the impact of its input features in a homogeneous way
- L1 regularization prevents the model from overfitting by focusing on those features which seem to be most important
both can be deployed in any machine learning model that minimizes a loss function
What are pros for linear models? (4)
- easy to understand and implement; resource efficient even for large and sparse data sets
- least squares method always provides best-fit results if the data is appropriate
- good interpretability due to linear nature of the model
- easy to regularize