Module 4 Flashcards
What is clustering?
Clustering means grouping similar data objects of a dataset together. These objects are gathered together in a group based on their similarity.
Is clustering a method of supervised or unsupervised learning?
Unsupervised learning
What is the main use of clustering?
Mostly it is used to explore the data because it groups similar data together. It can also search large amounts of data and then let us focus on one target cluster and search related groups. So it can act as a sort of filter mechanism
What is clustering not supervised learning?
It assumes that we do not have a clear formulation (e.g. labels) for a dataset
What do we call a step-by-step processing of the data with different algorithms until we get our desired output?
Machine Learning Pipeline
What method can we use to standardize data for a clusterbased on the recommendation from Han et al. (2011)
We can calculate means absolute deviation (MAD) because it is more robust toward outliers in comparison to SD. Then we can calculate z-score by using the MAD score. After that we can compare distances between each data object and present them in a matrix called a dissimilarity matrix
What do we call functions used to calculate distances between data objects?
Similarity measures, similarity metrics, or similarity functions
What is euclidean distance?
The most widely used distance measurement for numerical objects. We can use this when we have numerical objects and would like to measure the distance between two points which have “n” number of attributes. When certain attributes are more important than others, we can assign weights to each attribute. It is useful when there is low dimensional data and data is not sparse (not too many unknown values)
What is Manhattan distance?
Manhattan distance mimics the manhattan as a grid where their is no direct connection between two points in a 2D space of plane, and we are only allowed to move horizontally or vertically
What are the differences between Manhattan and Euclidean distance?
Euclidean distance is useful when there is low dimensional data but Manhattan distance is useful for high dimensional data. Euclidean distance uses square root while Manhattan distance uses absolute values so Euclidean tolerates noise better. Manhattan is called L1 norm while Euclidean is called L2 norm
What is the time complexity of Manhattan and Euclidean distance?
O(n)
What is Lp norm?
This is minkowski distance which we use in general for numerical data. It is a generalization of manhattan and euclidean distances where we can consider the weight of features as well
How does Mahalanobis distance work?
It operates based on the distribution of data points and their correlations. So data itself constructs the coordinate system for the measurement and it measures distance relative to the centroid.
- Identifies the centroid of the data points
- Draws an axis along the spine of data points in which the variance is the greatest
3.Draws the second axis perpendicular to the first axis - Based on the size of the two axes, the data set is segmented into the areas that they cover all the data points
What is Hamming Distance?
Hamming distance is a metric that is similar to the Manhattan distance, but between two vectors of bits. This distance measures the number of bits that need to be changed to convert the x string of bits into the y string of bits. It can also be used for string information. The time complexity is also O(n)
What is Levenshtein (Edit) distance?
This distance is also used to measure the similarity between two strings and it counts the minimum number of minimum edits required to transform the source data string into the target data. The time complexity of this is usually polynomial
What is cosine distance?
It is a similarity metric used to calculate the distance between two vectors of data. It is usually used for text document comparison
What is an example of using cosine distance?
A customer relations team can use a clustering report to group similar customer reports based on the words used in the text. In other words, to measure similarities between textual documents, we can assume text as a bag of words, and by counting frequency of similar among two documents, we can measure their similarities. It ranges between 0-1.