Topic 5: Machine Learning: Classification & Clustering Flashcards by Haiko Aragon

Calculate the Euclidean distance

SQRT((X_b-X_a)^{2<span> </span>}+ (Y_b-Y_a)²)

How well did you know this?

Not at all

Perfectly

Define nearest neighbors and combining function.

Nearest neighbours are the most similar instances; where combining function will give us a prediction (through voting/averaging).

How well did you know this?

Not at all

Perfectly

Explain how combining functions can be used for classification.

looking at the nearest neighbours and using a combining function, such as majority vote, to determine which class the instance belongs to.

How well did you know this?

Not at all

Perfectly

Calculate the probability of belonging to a class based on nearest
neighbor classification.

number of confirmations of belonging to that instance / total number of k instances

How well did you know this?

Not at all

Perfectly

Explain weighted voting (scoring) or similarity moderated voting (scoring)

Weighted scoring: influence of neighbours drops the further they’re from the instance

Similarity moderated voting:

How well did you know this?

Not at all

Perfectly

Explain how k in k-NN can be used to address overfitting.

1-NN memorizes the training data (very complex model). To address overfitting try different values for k and choose the one that gives the best performance on training data and evaluate this on the test data.

How well did you know this?

Not at all

Perfectly

Discuss issues with nearest-neighbor methods with a focus on
• Intelligibility
• Dimensionality and domain knowledge
• Computational efficiency.

Intelligibility - if intelligebility and justification are critical, NN should be avoided
Dimensionality and domain knowledge - curse of dimensionality (all attributes add to the distance, not all attributes are relevant)
Computational efficiency - training is very fast, prediction/classification of new instance is very inneficient/costly.

How well did you know this?

Not at all

Perfectly

Describe feature selection.

Selecting features that should be included in the model, can be done manually by someone with industry knowledge.

How well did you know this?

Not at all

Perfectly

Define and discuss the curse of dimensionality.

Some features are irrelevant but all of the features add to the distance calculations (misleading and confusion of the model).

How well did you know this?

Not at all

Perfectly

Calculate the Manhattan distance and the Cosine distance

Manhattan distance = (X_a-X_b)+(Y_a-Y_b)

How well did you know this?

Not at all

Perfectly

Define the Jaccard distance

overlapping items / total unique items

How well did you know this?

Not at all

Perfectly

Calculate edit distance or the Levenshtein metric

Number of changes it takes to change one text into another by used three actions:

insert, modify, or delete. It is used when order is important.

CAT to FAT (one modify action)

LD = 1

How well did you know this?

Not at all

Perfectly

Define clustering, hierarchical clustering, and dendrogram

clustering: unsupervised segmentation

hierarchical clustering: overlap between clusters where one cluster contains other clusters

dendogram: hierarchy of the clusters

How well did you know this?

Not at all

Perfectly

Describe how a dendrogram can help decide the number of clusters.

Horizontal lines can cut across at any point to get to the desired number of clusters.

How well did you know this?

Not at all

Perfectly

Describe the advantage of hierarchical clustering.

It allows you to see the groupings (ie landscape of data similarity)

How well did you know this?

Not at all

Perfectly

Define linkage functions.

Study These Flashcards

Distance functions between instances or clusters.

Describe how distance measures can be used to decide the number of clusters
in a dendrogram.

Study These Flashcards

choosing the line which yields the most clusters and removes the longest distances
very long distances are outliers (usually also its own cluster)

Define “cluster center” or centroid and k-means clustering

Study These Flashcards

cluster-center: geometric center of a group of instances

k-means clustering: means are the centroid (arithmetic mean) of the values along each dimension for instances in the cluster.

Compare and contrast k-means clustering with hierarchical clustering

Study These Flashcards

k-means starts with a desired number of k clusters

Describe the k-means algorithm

Study These Flashcards

find the points closest to the chosen centers (often random)
find the actual center of the clusters found in the first step

Describe the reason for running the k-means algorithm many times.

Study These Flashcards

result of a single clustering will find local optimum dependent on the initial centroid locations.

Define a cluster’s distortion

Study These Flashcards

sum of the squared differences between each data point and its corresponding centroid.

Describe the method for selecting k in the k-means algorithm.

Study These Flashcards

experiment with different k-values
graph various versions of K (elbow plot) and select the k where stabilization begins

Define and calculate the accuracy and error rate

Study These Flashcards

A general measure of classifier performance

Accuracy = Number of correct decisions / Total Number Of Decisions
(equal to 1-Error rate)

Describe a confusion matrix.

Summary of prediction results on a classification problem (n x n matrix)

Define false positives and false negatives

false positives (negative instances classified as positives)

Describe unbalanced data and the problems with unbalanced data

unbalanced data -\> where one class is rare evaluation based on accuracy breaks down

Discuss the problems with unequal costs and benefits of errors.

Simple classification accuracy as metric makes no distinction between false positives and false negatives (they are equally important). Ideally you would estimate the cost or benefit for each decision a classifier can make.

Calculate expected value and expected benefit.

**expected benefit**: probability\_response(x) \* (value response) + [1-probability\_response]\*(value no response)

Describe how expected value can be used to frame classifier use

if the expected value is greater than 0 target the customer

Describe how expected value can be used to frame classifier evaluation

you can use expected value to compare models.

Define and interpret precision and recall.

precision: TP / (TP + FP) = how many times does the model correctly predict a cancer patients out of the total positive predictions. recall: TP / FN -\> how many times did the model correctly predict cancer patients out of the total number of cancer patients

Calculate the value of the F-measure.

(precision \* recall) / (precision + recall)

Topic 5: Machine Learning: Classification & Clustering Flashcards

(33 cards)