Dual SVMs Flashcards
(15 cards)
What is the Lagrange relaxation for SVMs?
min w,b {1/2 ∥w∥ ^ 2 + Σ a(n)(1 − y(n)(wTϕ(x(n)) + b))}
Subject to: a(n) ≥ 0, ∀(x(n), y(n)) ∈𝒯
What is the primal formulation for SVMs?
min w,b {1/2 ∥w∥ ^2}
Subject to: y(n)(wTϕ(x(n)) + b) ≥ 1
∀(x(n), y(n)) ∈ 𝒯
What is the dual formulation for SVMs?
max a min w,b {1/2 ∥w∥ ^ 2 + Σ a(n)(1 − y(n)(wTϕ(x(n)) + b))}
Subject to: a(n) ≥ 0, ∀(x(n), y(n)) ∈𝒯
What is the minimax primal formulation for SVMs?
min w,b max a {1/2 ∥w∥ ^ 2 + Σ a(n)(1 − y(n)(wTϕ(x(n)) + b))}
Subject to: a(n) ≥ 0, ∀(x(n), y(n)) ∈𝒯
How do we get from a primal formulation to a dual formulation?
- Rewrite the constrains to satisfy the KKT conditions ( condition <= 0)
- Apply the Lagrange multiplier for each constraint
- Add max min
How do we remove w and b from the dual formulation? What are the equations?
Once a is fixed, there are no constrains and at the optimum ∇L(w) = 0.
=> w - Σa(n)y(n)ϕ(x(n)) = 0
=> w = Σa(n)y(n)ϕ(x(n))
Since the minimisation problem is also w.r.t b, we get:
=> Σ a(n)y(n) = 0
What is the dual formulation for SVMs after removing w and b?
argmax a L(a) = Σ a(n) - 1/2 Σ Σ a(n)a(m) y(n)y(m) k(X(n), X(m))
Subject to: Σ a(n) ≥ 0, ∀n ∈ {1,⋯,N} and a(n)y(n) = 0
What is the formula for the kernel trick?
k(x(n), x(m)) = ϕ(x(n))Tϕ(x(m)) = (1+xTz)^p
Where p = polynomial order
Why do we use the kernel trick?
Because the inner dot product of the original space and the transformed space is the same, so we don’t have to calculate the basis functions.
Whats the hypothesis for dual SVMs? (Hint: uses the kernel trick)
h(x) = Σ a(n)y(n)k(X,X(n)) + b
(for n ∈ S for support vectors only)
What is the value of a(n) when the training example is a support vector?
a(n) is either equal to 0 or > 0. When a(n) > 0, the training example is a support vector.
Since KKT slackness states a(n)(1-y(n)(wTϕ(X(n))+b)) = 0 and a > 0, the second half of the equation must be equal to 0. This can be rearranged to y(n)h(x(n)) = 1, so the value is on the margin.
Why can we only store support vectors when making predictions?
Because values that aren’t support vectors have a Lagrange multiplier of 0, and so don’t change the output of the equation.
How can we calculate the value of b and what is the final formula?
For a given support vector, y(n)h(x(n)) = 1. We can then take the steps:
- multiply by y(n)
- substitute y(n)^2 = 1 to get h(x(n)) = y(n)
- substitute h(x(n)) = y(n) into h(x) = Σ a(m)y(m)k(X(n),X(m)) + b
- rearrange for b
=> b = y(n) - Σ a(m)y(m)k(X(n),X(m))
where m ∈ S
What is the equation for the average b over all support vectors?
b = (1 / Ns) Σ (y(n) - Σ a(m)y(m)k(X(n),X(m)))
where n, m ∈ S
What is the benefit of getting an average b?
We get a more numerically stable solution