Data Mining - Chapter 7 (k-Nearest Neighbors) Flashcards

1
Q

What is k-Nearest Neighbor?

A

It is an algorithm that can be used for classification of a categorical outcome or prediction of a numerical outcome.

-> It works by finding similar records in the training data to classify the new record to their group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is voting in k-nearest neighbor?

A

If you have a specific k of neighbours, you look what their category is. Based on the majority of their outcomes, you set the category of the new record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What kind of method is the k-nearest neighbor?

A

It is a nonparametric method. That means it does not make any assumptions about the form of the relationship between the dependent variable and the predictor variables.
-> It looks at similarities between the predictor values of the records in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the process of the k-Nearest Neighbor?

A
  1. Determine the item’s neighbors
  2. Choosing the number of neighbors you want to include - choosing the k
  3. Computing the classification.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you determine a records neighbors?

A

It is based on similarity / closeness between records.

This is often called the distance between records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you note the distance between records?

A

Record i = ri
Record j = rj
Distance between the two records: dij

-> So you check the distance between two records (rows). They can have multiple predictor variables per record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the properties of distances?

A

P1: The distance is non-negative: dij > 0

P2: Self-proximity : Dii = 0

P3: Symmetry: dij = dji

P4: Triangle inequality: dij <= dik + dkj
(distance between any pair cannot exceed the sum of distances between the other two pairs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Euclidean distance and how does it work?

A

Most popular distance measure for numerical values.

dij = √ ((Xi1 - Xj1)^2 + (Xi2 - Xj2)^2 +…+ (Xip -Xjp)^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some remarks on the Euclidean distance?

A
  • Highly scale dependent
    Units of one variable can have a huge influence on the results.
    -> Therefore it is sometimes needed to standardize the values.
  • Sensitive to outliers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Manhattan distance?

A

A more robust distance than the Euclidean distance. It looks at the absolute differences instead of the squared differences.

dij = |Xi1 - Xj1| + |Xi2 - Xj2| +…+ |Xip - Xjp|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What happens if K is too low?

A

We may be fitting to the noise (useless data) in the dataset. This can lead to overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What happens if K is too high?

A

You miss out on capturing the local structure of the data –> you are just gonna average too much, reducing meaning of the classifications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What happens if K = n ?

A

We assign all the records to the majority class –> oversmoothing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you decide on the number of k in a python model?

A
  • Make the nearest K model.
  • Make it for all the K’s you want to try
  • For every model, compute the accuracy score (accuracy_score(test_y, predict.model(test_x)))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What changes if you use K nearest neighbors for numerical outcomes?

A

Everything stays the same. Except you do not determine class through majority voting but by taking the average outcome value of the k-nearest neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the benefits of K-nearest neigbors?

A
  • Simplicity of the model
  • There are no parametric assumptions
  • They do well in case of large trainingsets, especially when each class is characterized by multiple combinations of predictor values
17
Q

What are the disadvantages of k-nearest neighbors?

A
  • Computing the nearest neighbors can be time consuming
  • It is a lazy learner - we only compare at the time of prediction to the training set. We do not incorporate real-time data
  • Curse if dimensionality - The number of records required in the trainingset increases exponentially when you increase the number of predictors
18
Q

How can we reduce the computation time of the nearest neigbors?

A
  1. Work on fewer dimensions, using dimension reduction techniques such as principal components analysis
  2. Speed up identification of nearest neighbors using
    specialized data structures
19
Q

How can we deal with the curse of dimensionality?

A
  • Reduce the number of predictor variables.