KNN Flashcards
(6 cards)
What is K-Nearest Neighbour (KNN)?
A supervised machine learning algorithm.
Used for both classification and regression.
A non-parametric, lazy learning algorithm (memorizes training data instead of learning a discriminative function).
Principle: A new data point is classified/predicted based on the majority class or average value of its K closest neighbors in the training data.
Basic steps to classify/predict a new data point?
Choose K: Select the number of nearest neighbors (K) to consider.
Calculate Distances: Calculate the distance between the new data point and ALL points in the training dataset. (Common: Euclidean distance).
Find K Neighbors: Identify the K training data points that are closest (have the smallest distances) to the new data point.
Make a Prediction:
For Classification: Assign the new data point to the majority class among its K nearest neighbors.
For Regression: Predict the value for the new data point as the average (or weighted average) of the target values of its K nearest neighbors.
Why is the choice of K important?
Small K:
More sensitive to noise and outliers.
Can lead to a more flexible (less smooth) decision boundary.
Higher variance, lower bias.
Large K:
Smoother decision boundary.
More robust to noise.
Can oversmooth and miss local patterns.
Higher bias, lower variance.
K is often chosen via cross-validation. Odd K is preferred for binary classification to avoid ties.
KNN: What are common distance metrics used?
Euclidean Distance: Most common for continuous variables. sqrt(Σ(x_i - y_i)^2). (Notes pg 3 implies Euclidean for KNN example).
Manhattan Distance: Σ|x_i - y_i|.
Hamming Distance: For categorical variables (counts positions where attributes differ).
Data normalisation/scaling is often important before applying KNN, especially if features have different scales.
KNN: Main advantages and disadvantages?
Pros:
Simple to understand and implement.
No training phase (lazy learner).
Effective for complex decision boundaries.
Adapts easily to new data.
Cons:
Computationally expensive during prediction (calculates all distances).
Performance degrades with high-dimensional data (“curse of dimensionality”).
Sensitive to irrelevant features and the scale of data.
Requires careful choice of K.
KNN: Why might normalization be needed before applying KNN?
If features are on different scales, features with larger values can dominate the distance calculation.
Normalization (e.g., scaling to 0-1 range) ensures all features contribute more equally to the distance.
(Notes pg 3, “Data Preparation”)