Chapter 5 Flashcards
(20 cards)
explain the difference between the euclidean and the manhattan distance
> how to choose?
euclidean: simply the distance between two points
manhattan: points can only be connected by moving horizontally or vertically and not diagonally (hence manhattan)
> choice highly depends on dataset and can mostly only be determined after experimentation
what is the minkowski-distance?
a combination of euclidean and manhattan distance
> minkowski(1,x,y) = manhattan distance
> minkowski(2,x,y) = euclidean distance
what to consider when using euclidean or manhattan distance?
argue whether scaling is needed
> risk of certain attributes with high magnitude and spread might dominate distance calculations
explain Gowers similarity
> when can this only be used?
> difference between binary/categorical/numerical attributes?
Gowers similarity: attribute k can only be used when it has values for both instances i and j
- compute similarity for all attributes k: outcome is always scaled to [0, 1]
binary: 1 if xi and xj are both present, 0 otherwise
categorical: 1 if xi = xj, 0 otherwise
numeric: formula - compute gowers similarity:
> calculate sum of s(xi,xj) over all attributes and divide by the number of possible comparisons
- calculate the distance: 1 - gowers similarity
person level distance metrics:
> options for non-temporal distance metrics
non-temporal distance metrics between datasets:
- summarize values using mean, std, max etc, reducing each attribute to one instance
> then apply distance metric
- describe each attribute by distribution of values
> difference in distribution parameters represents the distance
- apply statistic test to test whether we can assume that the attribute distributions of person x stem from the same distribution as those of person y
> take 1-p as distance metric
temporal person level distance metrics:
name 2 raw-data based distance metrics (not DTW)
> assumptions?
temporal person level distance metrics:
- sum of the difference between individual points, e.g. by euclidean distance
- cross-correlation coefficient
> dataseries might be shifted, define lag tau to synchronize the series (minimization problem: find value for lag that minimizes distance over all attributes)
> cross-correlation distance: 1/ sum over pairwise products of the individual points in both series over all attributes
>>> both assume equal number of points
explain dynamic time warping
> advantage over ccc or euclidean?
> constraints?
dynamic time warping
> not only allows for shit, but also for difference in speed between time series
> DWT creates pairs of instances, matching each time point in series a to a point in series b
> constraints:
- monotonicity - time order needs to be preserved
- boundary condition - first and last points need to be matched
(only allowed to move up or right)
> find cheapest path - calculate minimum cost for each position in search space:
difference between values in that square
+ cheapest path that leads towards square
> cost of top right square is distance
explain k-means clustering
k-means:
- initialize k cluster centers randomly
- assign each datapoint to the cluster that minimizes the distance between the point and the cluster center
- recompute the center of each cluster
- repeat 2+3 until converge
in what cases is k-means clustering inappropriate?
> solution?
> raw-data based k-means clustering is not appropriate for person level (what does the center of a set of datasets mean?)
>>> we can apply k-medoids clustering
(select point that minimizes the distance to all other points in cluster as center)
two requirements for non-hierarchical clustering?
- predefined number of clusters
- only consider non-overlapping clusters
explain agglomerative clustering
> what are 4 metrics for cluster distance?
agglomerative clustering: start with each datapoint being an individual cluster, combine them based on distance between clusters until everything is in one cluster
> distance between clusters:
- single linkage: distance between the two closest points as distance between clusters
- complete linkage: distance between the two points that are furthest apart as distance between clusters
- group average: average distance between the points in the clusters
- Ward’s method: distance between clusters is the increase in standard deviation when merging the clusters
>>> merge clusters by considering maximum distance threshold for which cluster still should be merged
explain divisive clustering
> which cluster to select?
divisive clustering: opposite of agglomerative clustering
> start with single cluster: split cluster C by moving the point with the greatest dissimililarity to C to a new cluster C’
> stop when no point is more dissimilar to C than C’
>>> select the cluster with the largest diameter (maximum distance between points in the cluster)
what are 3 disadvantages of (non)-hierarchical clustering methods for the quantified self setting?
we might have huge attribute space, causing:
- long computational times
- calculating distance over large number of attributes problematic (curse of dimensionality)
- results will not be very insightful
explain subspace clustering
subspace clustering:
- create “units” that partition the data space by splitting the range of each variable into epsilon distinct intervals
- points belong to a unit when they fall within the bounds (for all attributes)
- only consider “dense” units
- units form a cluster when they share a common face (indirect connections are allowed)
subspace clustering: what is selectivity?
> when do we call a unit dense?
selectivity: fraction of instances belonging to the unit out of all instances
> a unit is dense when its selectivity exceeds a certain threshold tau
subspace clustering: what is the minimal description of a cluster?
minimal description:
> smallest set of boundary conditions that describes all units in the cluster
subspace clustering: how to efficiently find units that are dense over all dimensions?
- unit u can only be dense in k dimensions if all units of k-1 dimensions that are a subset of the constraints for unit u are also dense
> unit u in k dimensions is split up into one more dimension than unit u in k-1 dimensions: it occupies a smaller proportion of the data space and can thus never have more data instances in it
>>> start at 1 dimension and work up to k dimensions
what two strong assumptions do all clustering algoritms make? (except datastream clustering)
- assume that an unlimited amount of data can be stored
- all data should be treated in the same way
> underlying mechanisms generating the data might change, requiring new models
> this is called temporal locality or concept drift
datastream mining: 3 clustering approaches that do not require the typical clustering assumptions
- maintain a window of size n and cluster only the last n elements: each newly arriving instance replaces an element in our window (e.g. oldest)
- store cluster centers for chunks of data >>> apply clustering to the cluster centers, storing only cluster centers and weights of clusters
- estimate a mixture model of n data points
> update estimates for newly arriving data
how to evaluate clustering performance?
evaluate clustering performance using silhouette score
- calculate average distance of a point to the other points of a cluster a(x)
- calculate the average distance to the points in the nearest other cluster b(x)
- compare a(x) and b(x) and divide by the maximum of the two
>>> provides a measure of how tight the clusters themselves are relative to the distance to the cluster closest to them