Week 8: Support Vector Machines Flashcards

1
Q

Support Vector Machines (SVM’s)

A

They operate similarly to linear machines with margins. SVM’s rely on preprocessing data in high dimensions using Kernel functions. A model isn’t trained Instead, optimal weights are calculated. They’re based on Linear Discriminant Functions and operate similarly to Neural Networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear SVM’s

A

The data must be linearly separable and specifically handle 2 classes. The separating hyperplane is f(x) = w^Tx + w_0 = 0. The optimal margin is such that w^Tx + w_0 \ge 1 \forall x \in class 1 (+1), w^Tx + w_0 \le 1 \forall x \in class 2 (-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Linearly Separable Case

A

When samples of different classes can be separated by linear functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Margin of Hyperplane

A

The distance of a point from the hyperplane is z = f(x)/||w||. The exact margin can be calculated by using a support vector and the hyperplane weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Lagrange Multiplier Method

A

An approach to solve for the weights of the separating hyperplane for a linearly separable case. \mathcal{L}(w, w_0, \lambda) = (1/2)||w||^2 - \sum_{i=1}^N \lambda_i (y_i (w^Tx_i + w_0) - 1), with \lambda_i \ge 0, \foralll i=1,2,…,N

The following dual problem is setup:

\underset{\lambda \ge 0}{\max} \underset{w, w_0}{\min} ((1/2)||w||^2 - \sum_{i=1}^N \lambda_i (y_i (w^Tx_i + w_0) - 1)).

The following partial derivatives must be solved:

\frac{\partial \mathcal{L} (w, w_0, \lambda)}{\partial w} = 0 -> w = \sum_{i=1}^N \lambda_i y_i x_i

\frac{\partial \mathcal{L} (w, w_0, \lambda)}{\partial w_0} = 0 -> \sum_{i=1}^N \lambda_i y_i = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Non-Separable Case

A

This is when the data isn’t linearly separable. Let’s assume there are 2 classes. There are 3 categories of input samples.

  • Samples that fall outside the band and are correctly classified. y_i(w^Tx_i + w_0) \ge 1
  • Samples that fall inside the band and are correctly classified. 0 \ge y_i(w^Tx_i +w_0) < 1
  • Samples that are misclassified. y_i(w^Tx_i + w_0) < 0

Let’s introduce a slack variable \xi_i. Then y_i(w^Tx_i + w_0) \ge 1 - \xi_i

  • Samples that fall outside the band and are correctly classified. \xi_i=0
  • Samples that fall inside the band and are correctly classified. 0 < \xi_i \le 1
  • Samples that are misclassified. \xi_i > 1

For the second and third categories is obtained as \xi_i = 1 - y_i(w^Tx_i + w_0). We want to maximise the margin and minimise the number of misclassified points (minimise margin violations). As such, \underset{w, w_0, \xi}{\min} J(w, \xi) = (1/2)||w||^2 + C \sum_{i=1}^N \xi_i, such that y_i(w^Tx_i + w_0) \ge 1 - \xi_i, i = 1,2,…,N, \xi_i \ge 0, i = 1,2,…,N, and 0 \le C \le +\infty is a pre-set constant scalar that controls the influence of 2 competing terms.

The Primal problem is thus \mathcal{L}(w,w_0,\xi,lambda,\mu) = (1/2)||w||^2+ C\sum_{i=1}^N \xi_i - \sum_{i=1}^N \mu_i \xi_i - \sum_{i=1}^N \lambda_i(y_i(w^Tx_i + w_0) - 1 + \xi_i)), with \mu_i \ge 0, \lambda_i \ge 0. The dual problem is \underset{w,w_0, \xi}{\min} \underset{\lambda \ge 0, \mu \ge 0}{\max} \mathcal{L}(w, w_0, \xi, \lambda, \mu). Find the partial derivatives for w, w_0, and \xi_i and set them to 0 and solve for the systems of equation.

After doing all this, the dual problem is reduced to \underset{\lambda \ge 0}{\max} (\sum_{i=1}^N \lambda_i - (1/2)\sum_{i=1}^N \sum_{j=1}^N \lambda_i \lambda_j y_i y_j x_i^T x_j)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Nonlinear SVM’s

A

We can perform transformations using feature mapping: z_i = \Phi(x_i). The result transformation can alter the feature space to make the data linearly separable. \sum_{i \in SV’s}\lambda_i y_i \Phi(x_i)^T \Phi(x) + w_0 = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hard Classifier

A

f(x) = sgn(w^Tx + w_0). Note that w^T and x can be transformed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Soft Classifier

A

f(x) = h(w^Tx + w_0). Note that w^T and x can be transformed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Kernel-based Nonlinear SVM’s

A

Certain kernels that satisfy Mercer’s Theorem can map the data to a high-dimensional feature space implicitly. K(x_i, x) = \Phi (x_i)^T \Phi(x). They reduce the amount of computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Linear Kernel

A

K(x_i, x) = x_i^Tx+c, with c as an optional constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Polynomial kernel of Degree q:

A

K(x_i, x) = (\alpha x_i^T x + c)^q, with q > 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Radial Basis Function (RBF)

A

K(x_i, x)= e^{-\frac{||x_i - x||^2}{2 \sigma^2}}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Multi-quadratic Kernel

A

K(x_i, x) = \sqrt{||x_i - x||^2 + c}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Inverse Multi-quadratic Kernel

A

K(x_i, x) = \frac{1}{\sqrt{||x_i - x||^2 + c}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Power Kernel

A

K(x_i, x) = - ||x_i - x||^q

17
Q

Log Kernel

A

K(x_i, x) = - \log(||x_i - x||^q + 1)

18
Q

Sigmoid Function (Hyperbolic Tangent)

A

K(x_i, x) = \tanh(\beta x_i^T x + \gamma)

19
Q

Other Kernels

A

Examples include a linear combination of kernels or a product of kernels.

20
Q

Multi-class SVM Classifiers

A

These are built by combining 2-class classifiers. Can handle more than 2 classes.

21
Q

One-against-one

A

Binary classifiers for each pair of classes are calculated. Has R(R-1)/2 total classifiers. Uses a majority vote to classify a point, with all SVM’s consulted.

Pros:
- Training is faster for each SVM since the training set is smaller compared with the whole dataset.
- Can flexibly add or remove classes. For adding classes, existing SVM’s are unaffected, only need trained SVM’s for new classes. For removing classes, the SVM’s for the corresponding classes can simply be removed.

Cons:
- The number of SVM’s is larger compared with other approaches.
- Many SVM’s have to be evaluated to make the final decision.
- The class of some regions of the feature space can’t be determined.

22
Q

One-against-all

A

For each class, a SVM is calculated to separate that class from the rest of the data. There are a total of R SVM’s, with all of them consulted.

Pros:
- The number of SVM’s to be trained is relatively few.
- Fewer SVM’s have to be evaluated to make the final decision.

Cons:
- The training time may be longer as the whole dataset is used for training each SVM.
- When adding/removing classes, all SVM’s have to be retrained.
- The class of some regions of the feature space can’t be determined.

23
Q

Binary Decision Tree

A

At each node of the binary tree, split the remaining classes in half. Creates R-1 SVM’s, consults ceil(\log_2 R) SVM’s.

Pros:
- Number of SVM’s to be trained is relatively few.
- When adding classes, only some SVM’s need to be trained. The other existing SVM’s can be re-used.
- Number of SVM’s that need to be evaluated to make a decision is further reduced.
- Training time is slightly faster since not all SVM’s use the whole training dataset. Smaller training sets are used as the algorithm descends the decision tree.

Cons:
- When removing classes, all SVM’s need to be re-trained.
- The classification performance heavily depends on the SVM’s in the upper level. Classification error will propagate.

24
Q

Binary Code

A

This method splits the total collection of classes for each SVM. Trains a total of ceil(\log_2 R), and consults all SVM’s. Each class has a different combination of SVM outputs.

Pros:
- Number of SVM’s to be trained is further reduced.
- Number of SVM’s to be evaluated making a decision is further reduced.

Cons:
- It may lead to undefined classes if the number of classes isn’t exactly 2^{Number of SVM’s}.
- Training time may be longer as the whole dataset is used for training for each SVM.
- When adding/removing classes, all SVM’s have to be re-trained.