Module 4 Flashcards

1
Q

What is clustering?

A

Clustering means grouping similar data objects of a dataset together. These objects are gathered together in a group based on their similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is clustering a method of supervised or unsupervised learning?

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the main use of clustering?

A

Mostly it is used to explore the data because it groups similar data together. It can also search large amounts of data and then let us focus on one target cluster and search related groups. So it can act as a sort of filter mechanism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is clustering not supervised learning?

A

It assumes that we do not have a clear formulation (e.g. labels) for a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What do we call a step-by-step processing of the data with different algorithms until we get our desired output?

A

Machine Learning Pipeline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What method can we use to standardize data for a clusterbased on the recommendation from Han et al. (2011)

A

We can calculate means absolute deviation (MAD) because it is more robust toward outliers in comparison to SD. Then we can calculate z-score by using the MAD score. After that we can compare distances between each data object and present them in a matrix called a dissimilarity matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do we call functions used to calculate distances between data objects?

A

Similarity measures, similarity metrics, or similarity functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is euclidean distance?

A

The most widely used distance measurement for numerical objects. We can use this when we have numerical objects and would like to measure the distance between two points which have “n” number of attributes. When certain attributes are more important than others, we can assign weights to each attribute. It is useful when there is low dimensional data and data is not sparse (not too many unknown values)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Manhattan distance?

A

Manhattan distance mimics the manhattan as a grid where their is no direct connection between two points in a 2D space of plane, and we are only allowed to move horizontally or vertically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the differences between Manhattan and Euclidean distance?

A

Euclidean distance is useful when there is low dimensional data but Manhattan distance is useful for high dimensional data. Euclidean distance uses square root while Manhattan distance uses absolute values so Euclidean tolerates noise better. Manhattan is called L1 norm while Euclidean is called L2 norm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the time complexity of Manhattan and Euclidean distance?

A

O(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Lp norm?

A

This is minkowski distance which we use in general for numerical data. It is a generalization of manhattan and euclidean distances where we can consider the weight of features as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does Mahalanobis distance work?

A

It operates based on the distribution of data points and their correlations. So data itself constructs the coordinate system for the measurement and it measures distance relative to the centroid.

  1. Identifies the centroid of the data points
  2. Draws an axis along the spine of data points in which the variance is the greatest
    3.Draws the second axis perpendicular to the first axis
  3. Based on the size of the two axes, the data set is segmented into the areas that they cover all the data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Hamming Distance?

A

Hamming distance is a metric that is similar to the Manhattan distance, but between two vectors of bits. This distance measures the number of bits that need to be changed to convert the x string of bits into the y string of bits. It can also be used for string information. The time complexity is also O(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Levenshtein (Edit) distance?

A

This distance is also used to measure the similarity between two strings and it counts the minimum number of minimum edits required to transform the source data string into the target data. The time complexity of this is usually polynomial

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is cosine distance?

A

It is a similarity metric used to calculate the distance between two vectors of data. It is usually used for text document comparison

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is an example of using cosine distance?

A

A customer relations team can use a clustering report to group similar customer reports based on the words used in the text. In other words, to measure similarities between textual documents, we can assume text as a bag of words, and by counting frequency of similar among two documents, we can measure their similarities. It ranges between 0-1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the symbol to present norm?

A

|| || so for example:

||X||1 is L1 norm or manhattan distance

19
Q

What is Jaccard Distance or Jaccard Index?

A

This similarity is operated based on the union and intersection of two sets. Linear computational complexity O(n)

20
Q

When is Jaccard index useful?

A

It is useful when we are dealing with non-numerical data objects and their semantical similarity is important.

21
Q

Give an example of using Jaccard index.

A

We collect the visited restaurants of a user to identify his taste for food recommendation. In this case, we don’t intend to compare names of visited places. Instead we intend to have a binary comparison of food menus for each restaurant.

This method not recommended because we need to define the similarity in binary comparison ourselves but it can be useful when two sets are not only unequal in size but also significantly so.

22
Q

What is Dynamic Time Warping?

A

An effective distances measurement to measure time series similarities.

23
Q

What is a time series?

A

A numerical vector of data in a specific temporal duration

ex. heart rate per minute

24
Q

How can we use DTW?

A

Assume you have two time series, if you use Euclidean distance rather than DTW, the two time series will get a high dissimilarity and low similarity score. DTW can handle the distance calculation significantly better

25
Q

How does DTW work?

A

It matches the peaks and valleys and provides more appropriate distance mapping between two similar time series than Euclidean distance does

26
Q

How are Minkowski distances (e.g. Euclidean and Manhattan) different from DTW?

A

Minkowski distances measure the distance based on one-to-one mapping between data points. DTW however measures the distance based on one-to-many data points mapping. It uses a cumulative matrix. After the cumulative matrix is filled with data, the algorithm starts from the diagonal of the matrix and selects the minimum values until it reaches the end of the matrix diagonal.

27
Q

What is the name of the path that constructs the DTW distances in red dots?

A

The warping path

28
Q

What does the warping constant represent?

A

The parameter w, presents the “maximum” amount of deviation allowed from the diagonal.

29
Q

What is the time complexity of Standard DTW?

A

O(n^2) which is not efficient since distance functions compare all distances between data objects but there are some good implementations that have O(n)

30
Q

What are 2 partition based clustering methods?

A

K-mean and k-mediod

31
Q

What is k-mean partition based clustering?

A

It is the most commonly used clustering algorithm. To use it, we should specify the number of expected clusters before running the algorithm. Then the k-mean algorithm partitions the data set into the k number of distinct clusters.

The second hyper parameter of k-mean is the maximum number of iterations (user-input)

32
Q

What are the specific steps in k-mean clustering?

A

The algorithm chooses k random points based on the number of expected clusters called seed points

Then the algorithm draws distances of all other data points to the k points

Then the algorithm assigns each data point to its nearest seed point. This means that for each seed point, we have one cluster and cluster data points are the nearest to that seed point

In the fourth step, the algorithm calculates the center of all data points in each cluster and moves each seed point to the center of the cluster.

The fifth step is the algorithm repeats this process until all seed points signs stay on the center of the cluster and they do not move any further. These points are called centroid of cluster.

33
Q

What is the time complexity of k-mean partition based clustering?

A

Assuming k number of clusters and i iterations, we assume O(n x k x i)

34
Q
A
35
Q

What is k-mediod?

A

Similar to k-mean, k-mediod is another data partitioning algorithm. It tries to keep the data with the minimum distance in their cluster and creates clusters with maximum distance from each other. It is more robust to noise because rather than Euclidean distance between data points it uses pairwise dissimilarities

36
Q

What are the steps of k-mediod partition clustering?

A

First, the algorithm selects two random data points from existing data points.

Then the algorithm calculates the distances of all other data points to these two data points. Based on the calculated distances to those data points, the algorithm assigns each data point into a cluster

Next, it finds the mediod data point, which is the center data point, not the mean between data points

Then it calculates the clusters’ mediod and continues repeating the steps 2 and 3until there will be no change in the mediod member of each cluster

37
Q

How is k-mediod different from k-mean?

A

It is more tolerant to outliers and the computational complexity is O(k(n-k))^2 assuming k clusters and n data points. So k-mean is more efficient

38
Q

What is the last type of k-representative algorithm?

A

K-median, which uses Manhattan distance while k-mean uses Euclidean and k-mediod uses the mediod

39
Q

Name the general steps in the clustering process

A

Select features for clustering

Normalize, clean, … features

Create the dissimilarity matrix

Feed dissimilarity matrix into the clustering algorithm

40
Q

When do we want to use the Mahalanobis distance method?

A

In some cases, the dataset distribution affects the similarity measurement, so we want to incorporate that into the distance calculation. Mahalanobis is similar to Euclidean except it normalizes the data on the basis of the inter-attribute correlations.

41
Q

How is the distance between data points calculated in mahalanobis distance calculation?

A

First find the centroid point. It identifies the center (mean) or the centroid of the data points.

Then it tries to figure out how all the data points are spread on the dataset so it calculates the covariance matrix which captures how each dimension or variable in the dataset varies and how the dimensions are correlated with each other.

Next it measures the distance between each data point and the average point. It uses the inverse of the covariance matrix to scale the distance of each data point to the average point. Simply put, instead of measuring distance in a straight line like Euclidean, it measures it in a unique way that considers how the points are spread out.

Finally, it adjusts the distances based on the spread of data points. If points are spread out in one direction, then the distance in that direction counts for less. To do that, it takes the scaled distances for each dimensions, squares them, and adds them up. This sum is then square rooted to get the distance

42
Q

What is the time complexity of Mahalanobis distance?

A

Since we transpose the covariance matrix, it has exponential complexity

43
Q
A