ML exam 1 Flashcards
(23 cards)
What is supervised learning?
to learn a model from labeled training data that allows us to make predictions about unseen or future data
Rosenblatt perception
- binary classification task
- positive class (1) vs negative class (-1)
-takes input as a dot product of input and weights
step function
1 if z >= theta
-1 if otherwise
what does z equal
the linear combination
rosenblatt perception algorithm
- initialize the weight to 0 or small number
- for each training sample x(i),
a. comput y hat or output value
b. update weights
weight update rule
w(j) = w(j) + deltaw(j)
perception learning rule
deltaw(j) = n(y(i) - y hat(i))xj(i)
linear separability
draw a line through the negative and positive class
convergence
convergence if guaranteed if the two classes are linearly separable and learning rate is sufficiently small
if classes cannot be separated,
Set a maximum number of passes over the training dataset
(epochs)
Set a threshold for the number of tolerated misclassification
Otherwise, it will never stop updating weights (converge)
diagram of Rosenblatt perception
see pic
Adaline
Weights updated based on a linear activation function
Remember that perceptron used a unit step function
φ(z) is simply the identity function of the net input
φ
Adaline diagram
see pic
adaline vs rosenblatt
The weight update is done based on all samples in training set
Perceptron updates weights incrementally after each sample
This approach is known as “batch” gradient descent
cost function and equation
ML algorithms often define an objective function
This function is optimized during learning
It is often a cost function we want to minimize
Adaline uses a cost function J(·)
Learns weights as the sum of squared errors (SSE)
advantages of adaline cost function
The linear activation function is differentiable
Unlike the unit step function
Why derivatives?
We need to know how much each variable affects the output!
It is convex
Can use gradient descent to learn the weight
gradient descent
More precisely, the
gradient points in the direction of the greatest rate of increase
of the function, and its magnitude is the slope of the graph in
that direction.
- finds the local minimum of a given function
gradient computation
To compute the gradient of the cost function, we need to compute
the partial derivative of the cost function with respect to each
weight wj
We update all weights simultaneously, so Adaline learning rule
becomes
w := w + ∆w.
adaline vs rosenblatt
Looks (almost) identical. What is the difference?
theta(z(i)) with z(i) being the wTx is a real number
And not an integer class label as in Perceptron
The weight update is done based on all samples in training set
Perceptron updates weights incrementally after each sample
This approach is known as “batch” gradient descent
if the learning rate is too high
error becomes larger (overshoots global min)
if the learning rate is too low
takes too many epochs to cover
stochastic gradient descent
an optimization algorithm often used in machine learning applications to find the model parameters that correspond to the best fit between predicted and actual outputs