Test 1 Flashcards
(151 cards)
What does the model or hypothesis represent in a linear model?
A real-valued function from the instances to some target attribute
Each training instance can be represented as what?
A row vector x = <x1,x2,….,xk>
In the linear model’s equation each 0j is a what?
A real valued constant (weight)
In the linear model’s equation h0(x) is what?
The estimated value of y for instance x
What really determines the linear model’s function?
The values we choose for each of the weights
For any training instance x the sum is what?
the dot product of the weight and training instance (0 . x)
h0(x) defines a k dimensional what?
Hyperplane
What is the residual?
(y(i) - h0(x(i)))^2
What is the space of possible values for 0?
The error surface
What does the gradient vector ∇J at a given point represent?
The direction of the greatest rate of increase in J at the point
What does the gradient vector ∇J at a given point on the error surface represent?
The slope (at that point) of the surface in the jth dimension
What is a?
A small real valued constant (learning rate)
If the gradient vector ∇J at a given point is 0 what does this mean?
No further updates can occur as the local minimimum for J(0) has been reached. The gradient descent stops at this point
In the context of a linear regression the cost function of J is what?
A convex
If J is convex what does this mean?
There is only one minimum and gradient descent can safely be used to find it
If the original function to be learned is not linear will gradient descent work?
There may be many local minima and you are not guaranteed to find the global minimum.
What is Batch Gradient Descent?
All instances in the data set are examined before updates are made
What is Stochastic Gradient Descent?
A randomly chosen instance or random samples is used instead of the entire data set
What are the benefits of stochastic gradient descent?
The error is reduced more quicklyW
What are the downsides of using stochastic gradient descent?
You may not get the minimum but only an approximation
If a good value for α is chosen then J should what?
Decrease with each iteration.
If α is too large what may happen?
J might not converge, it may increase without bound or oscillate between points.
If α is too small what may happen?
The gradient descent might take a very long time to converge
What is typically used to scale the inputs?
The standard score (xj)