Nearest neighbor Flashcards
What are eager and lazy learners?
Eager learners refer to models that learn as soon as the training data becomes available. It is then “learned”, and can predict shit.
Lazy learners wait with learning until the task of prediction is to be done.
name an example of lazy learner
Rote classifier.
Rote classifier memorize the entire training set. It then performs classification only if one of the test instances match a training instance in regards to attributes, perfectly.
The obvious drawback of this is the risk of instances not being classified, as they lack the training to do so.
We can do this more flexible by finding the set of all training instances that are RELATIVELY similar to the test instance(s)’ attributes. This is known as “nearest neighbor classifier”.
What is the intuition behind nearest neighbor classifier?
“if it walks like a duck, and make sound like a duck, it is probably a duck”.
Elaborate on nearest neighbor classifier
Nearest neighbor represent each example as a data point in a D-dimensional space, where D is the number of attributes.
Then it will use a proximity measure to compute the distance between a test-instance (of prediction instance) and the training set instances.
The k-nearest neighbors refers to the k nearest training instances relative to the test instance.
We use “k” as a measure of how many data points we want to use our analysis on. If k=1, we basically assign the class label to be the same label as the closest training instance.
If k>1, we assign the majority share label to be the class label of the test instance.
What happens if “k” is too small, say 1?
We are likely to get overfitted results. We capture a shit ton of noise.
What happens if “k” is too large?
We might drag different “areas” as one, and get bad results.
KNN is very sensitive to the number k. how can we balance it out a little?
we can add a weight that depends on the distance:
wi = 1 / [d(x’, xi)^2]
What is especially important to implement using KNN?
A proximity measure
Elaborate on performance in regards to time
They are generally quite expensive, as we have to compute all distances for every test set instance.
In comparison, decision trees spend their bulk in the first phase. When they are done, they are extremely efficient (linear time, doesnt get much better).
Why are KNN more prone to noise than decision trees?
KNN finds local models, while decision trees find global models. The globability of decision trees remove noise, especially with early stoppage.
When we use KNN with a relatively low value for k, we are looking at potentially VERY small areas, which can contain a lot of noise.
Name an advantage of KNN in comparison to decision trees
KNN can split the space more flexible. Recall that decision trees are generally restricted to rectilinear shapes.
KNN can have circles, or more complex functions.
So, the most important advantage of KNN is the fact that it can take on complex shapes like spirals etc that decision trees can only dream of managing.
KNN also have the edge over trees in regards to being able to represent relationships among attributes.
Lastly, what is important to consider when using KNN?
It is a proximity based procedure. Therefore, we have to be careful about the scales. We should normalize the data.
What is “precision”?
Precision = TruePositives / (TruePositives+FalsePositives)
So, precision says “out of all the ones you predicted to be positive, how many of those are actually positive”. Note how precision is not about “correct” or not, it is about how many true the model predicts relative to the number of actual true elements.
Precision is a measure that gives us more than accuracy in some cases. Accuracy struggle on unbalanced data sets. Precision is better on such sets.
Example:
The model predicted that 10 were positive. However, from the test set we know that only 7 of them were actually positive. That gives a precision of 0.7
What is “recall”?
Recall is defined as “the number of true positives divided by the sum of true positives and false negatives”.
Recall = TruePositives / (TruePositives + FalseNegatives)
Recall thus tells us how many of the ones that are actually correct we actually got right.
What is F-measure?