Part 2: K nearest neighbors Flashcards

1
Q

Examples of classification tasks

A
  • Abnormalities on screening mammograms.
  • Classifying credit card transactions as legitimate or suspicious.
  • Email: spam/not-spam
  • Categorizing news stories in finance tweets (+,-,0)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Classification

A

Given a set of records (training set) find a model for class attribute as a function of the values of other attributes.

  • In a training set, each record contains a set of attributes, one of the attributes is the class.
  • A validation set is used to determine the accuracy of the model. Usually, the given dataset is divided into training and validation set, with training set used to build the model and validation set used to validate it.

Goal: previously unseen records should be assigned a class as accurately as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Nearest-Neighbor Classifiers

A

Requires 3 things:

  • Set of stored records.
  • Distance metric to compute distance between records.
  • Value of k, the numbers of nearest neighbors to retrieve.

To classify an unknown record:

  • Compute distance to other training records.
  • Identify k nearest neighbors.
  • Use class labels of nearest neighbors to determine the class labels of unknown record.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

K-nearest neighbors

A

K-nearest neighbors of a record X are data points that have the smalles distance to X. The value depends on the ‘number’ of nearest neighbors. 1-nearest neighbor has value -, 2-nearest neighbor has a neutral value, 3-nearest neighbor has a positive (+) value.

To make sure that you’re not getting neutral values, you can use k-odd. Then, you always have a majority.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Distance measures

A
  • Euclidean distance
  • Manhattan distance
    Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes.
  • If an attribute value has the same distribution on the classes it does not discriminate.
  • If attribute value has a different distribution than should it count in distance.
    + small distance implies that the discriminating attributes are equal and that also implies that they probably are in the same class.
    + larger distance implies that the discriminating attributes are different and that also implies that they are probably in a different class.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Choosing the value of k

A
  • If k is too small, it is sensitive to noise points.

- If k is too large, the neighbourhood may include points from other classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

General method to find the weights and k

A

The weights can be found by the gradient descent on a test set. This method keeps changing the weights until it is optimized. Optimizing k is quite different. You cannot apply the gradient descent, because it is a discrete variable. Therefor, you have to try many different values of k and look for the optimum (parabolic graph with minimum = k)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Curse of dimensionality

A

In high dimensions (many attributes) everything is far. There are no points nearby. Unless a large number of data is available (exponential in the number of dimensions).

Drawback: store the training records and use these records to predict the class labels of the unseen cases –> takes a lot of storage/computing power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Non-parametric (instance based) methods

A

Nearest N

Naive bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Parametric (model based) methods

A

Linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly