L3 Flashcards
(36 cards)
What type of learning does KNN fall under?
Supervised Learning
- learning algorithm provided with input-output pairs
KNN uses input-output pairs for classification and regression tasks.
What are the two main tasks of Supervised Learning?
- Classification: Output discrete, e.g. class labels.
- Regression: Output continuous, e.g. predicting a value like salary or temperature.
What does KNN stand for?
K-Nearest Neighbors
What is a defining characteristic of KNN as a learning algorithm?
Non-parametric, instance-based, lazy learning algorithm.
What does ‘non-parametric’ mean in the context of KNN?
Makes no assumptions about the form of the mapping function.
What does ‘lazy learning’ imply in KNN?
No explicit training phase; algorithm stores the entire dataset and computes output only when a query is made.
How does KNN classify or predict the output for a new data point?
- Compute distance between the new point and all points in the training dataset.
- Identify the k nearest training examples
- Use these neighbors to determine the output:
- Classification: Take the majority vote
- Regression: Take the mean (or sometimes median) of the target values.
What is the method used by KNN for classification?
Take the majority vote of the k nearest training examples.
For regression in KNN, what is the method used to determine the output?
Take the mean (or sometimes median) of the target values.
What is a decision boundary in KNN?
A line or curve that separates different classes in the feature space.
- Linear: Formed in simple models (e.g., logistic regression).
- Non-linear: Formed in KNN with low k due to sensitivity to local data patterns.
What happens to the decision boundary in KNN with low k?
Forms a complex boundary, leading to overfitting.
- KNN forms complex and non-linear boundaries depending on the value of k and data distribution.
What happens to the decision boundary in KNN with high k?
Forms a smooth boundary, leading to underfitting.
- KNN forms complex and non-linear boundaries depending on the value of k and data distribution.
What is the difference between classification and regression in KNN?
- Classification: Assign class label; model trained from the data defines a decision boundary that separates the data; discrete labels (e.g. cat / dog); based on majority vote of neighbours
- Regression: Predict numeric target; model fits the data to describe the relation between 2 features or between a feature (e.g., height) and the label (e.g., yes/no); based on average of neighbour values
What is the hyperparameter k in KNN?
Represents the number of labeled neighbors to consider.
What is the risk associated with a small value of k (e.g., k=1)?
High variance, very flexible, risk of overfitting.
What is the risk associated with a large value of k (e.g., k=N)?
Low variance, oversmooth, risk of underfitting.
**k = N: since all datapoints are considered, the predicted label for a test point will always be the the majority label of all datapoints. Equivalent to a majority classifier.
What is the effect of ties in KNN classification?
Random selection from the tied labels is common.
**Ties: in case of a tie between predicted labels, there are different possibilities. The most common one is random selection from the tied labels.
What is the recommended value for k in relation to the number of training samples?
Generally, k = √n (n = number of training samples).
**Use odd values for binary classification to avoid ties.
**Use cross-validation on a validation set to choose optimal k.
What distinguishes weighted KNN from standard KNN?
In standard KNN, each neighbor contributes equally VS In weighted KNN, neighbors closer to the test point contribute more to the decision.
Effect: Improves performance, especially when data is dense and neighbors vary in quality.
**with distance weighting, k=n is no longer equivalent to a majority based classifier
What are the types of weighting used in weighted KNN?
- Inverse distance - each point has a weight equal to the inverse of its distance to the point to be classified (neighboring points have a higher vote)
- Inverse squared distance
- Kernel functions (e.g., Gaussian kernel)
What does the distance function determine in KNN?
How ‘closeness’ is measured between points.
What are the two distance types mentioned in KNN?
- Euclidean - straight line (use for continuous numeric data) –> sqrt (x2-x1)^2 + (y2-y1)^2
- Manhattan - dist btwn projections on axis (x_1 - x_2)
Effect:
Different distance metrics can change which points are considered neighbors.
Affects classification and regression outcomes.
What is the Nearest Centroid Classifier?
For each class, compute the centroid (mean vector of all feature values) and classify new instances into the class whose centroid is closest (using Euclidean distance).
What is the purpose of the Nearest Shrunken Centroid?
To reduce the effect of irrelevant or noisy features.
- Each class centroid is shrunk toward the overall mean by a threshold.
- Helps reduce the effect of irrelevant or noisy features.
- Common in high-dimensional spaces, e.g., gene expression, text data