Topic 5: Machine Learning: Classification & Clustering Flashcards

(33 cards)

1
Q

Calculate the Euclidean distance

A

SQRT((Xb-Xa)2<span> </span>+ (Yb-Ya)2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define nearest neighbors and combining function.

A

Nearest neighbours are the most similar instances; where combining function will give us a prediction (through voting/averaging).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain how combining functions can be used for classification.

A

looking at the nearest neighbours and using a combining function, such as majority vote, to determine which class the instance belongs to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
Calculate the probability of belonging to a class based on nearest
neighbor classification.
A

number of confirmations of belonging to that instance / total number of k instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain weighted voting (scoring) or similarity moderated voting (scoring)

A

Weighted scoring: influence of neighbours drops the further they’re from the instance

Similarity moderated voting:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain how k in k-NN can be used to address overfitting.

A

1-NN memorizes the training data (very complex model). To address overfitting try different values for k and choose the one that gives the best performance on training data and evaluate this on the test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Discuss issues with nearest-neighbor methods with a focus on
• Intelligibility
• Dimensionality and domain knowledge
• Computational efficiency.

A
  • Intelligibility - if intelligebility and justification are critical, NN should be avoided
  • Dimensionality and domain knowledge - curse of dimensionality (all attributes add to the distance, not all attributes are relevant)
  • Computational efficiency - training is very fast, prediction/classification of new instance is very inneficient/costly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe feature selection.

A

Selecting features that should be included in the model, can be done manually by someone with industry knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define and discuss the curse of dimensionality.

A

Some features are irrelevant but all of the features add to the distance calculations (misleading and confusion of the model).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Calculate the Manhattan distance and the Cosine distance

A

Manhattan distance = (Xa-Xb)+(Ya-Yb)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define the Jaccard distance

A

overlapping items / total unique items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Calculate edit distance or the Levenshtein metric

A

Number of changes it takes to change one text into another by used three actions:

insert, modify, or delete. It is used when order is important.

CAT to FAT (one modify action)

LD = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define clustering, hierarchical clustering, and dendrogram

A

clustering: unsupervised segmentation

hierarchical clustering: overlap between clusters where one cluster contains other clusters

dendogram: hierarchy of the clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe how a dendrogram can help decide the number of clusters.

A

Horizontal lines can cut across at any point to get to the desired number of clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe the advantage of hierarchical clustering.

A

It allows you to see the groupings (ie landscape of data similarity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define linkage functions.

A

Distance functions between instances or clusters.

17
Q

Describe how distance measures can be used to decide the number of clusters
in a dendrogram.

A
  1. choosing the line which yields the most clusters and removes the longest distances
  2. very long distances are outliers (usually also its own cluster)
18
Q

Define “cluster center” or centroid and k-means clustering

A

cluster-center: geometric center of a group of instances

k-means clustering: means are the centroid (arithmetic mean) of the values along each dimension for instances in the cluster.

19
Q

Compare and contrast k-means clustering with hierarchical clustering

A

k-means starts with a desired number of k clusters

20
Q

Describe the k-means algorithm

A
  1. find the points closest to the chosen centers (often random)
  2. find the actual center of the clusters found in the first step
21
Q

Describe the reason for running the k-means algorithm many times.

A

result of a single clustering will find local optimum dependent on the initial centroid locations.

22
Q

Define a cluster’s distortion

A

sum of the squared differences between each data point and its corresponding centroid.

23
Q

Describe the method for selecting k in the k-means algorithm.

A
  1. experiment with different k-values
  2. graph various versions of K (elbow plot) and select the k where stabilization begins
24
Q

Define and calculate the accuracy and error rate

A

A general measure of classifier performance

Accuracy = Number of correct decisions / Total Number Of Decisions
(equal to 1-Error rate)

25
Describe a confusion matrix.
Summary of prediction results on a classification problem (n x n matrix)
26
Define false positives and false negatives
false positives (negative instances classified as positives)
27
Describe unbalanced data and the problems with unbalanced data
unbalanced data -\> where one class is rare evaluation based on accuracy breaks down
28
Discuss the problems with unequal costs and benefits of errors.
Simple classification accuracy as metric makes no distinction between false positives and false negatives (they are equally important). Ideally you would estimate the cost or benefit for each decision a classifier can make.
29
Calculate expected value and expected benefit.
**expected benefit**: probability\_response(x) \* (value response) + [1-probability\_response]\*(value no response)
30
Describe how expected value can be used to frame classifier use
if the expected value is greater than 0 target the customer
31
Describe how expected value can be used to frame classifier evaluation
you can use expected value to compare models.
32
Define and interpret precision and recall.
precision: TP / (TP + FP) = how many times does the model correctly predict a cancer patients out of the total positive predictions. recall: TP / FN -\> how many times did the model correctly predict cancer patients out of the total number of cancer patients
33
Calculate the value of the F-measure.
(precision \* recall) / (precision + recall)