explain the difference between the euclidean and the manhattan distance
> how to choose?
euclidean: simply the distance between two points
manhattan: points can only be connected by moving horizontally or vertically and not diagonally (hence manhattan)
> choice highly depends on dataset and can mostly only be determined after experimentation
what is the minkowski-distance?
a combination of euclidean and manhattan distance
> minkowski(1,x,y) = manhattan distance
> minkowski(2,x,y) = euclidean distance
what to consider when using euclidean or manhattan distance?
argue whether scaling is needed
> risk of certain attributes with high magnitude and spread might dominate distance calculations
explain Gowers similarity
> when can this only be used?
> difference between binary/categorical/numerical attributes?
Gowers similarity: attribute k can only be used when it has values for both instances i and j
1. compute similarity for all attributes k: outcome is always scaled to [0, 1]
binary: 1 if xi and xj are both present, 0 otherwise
categorical: 1 if xi = xj, 0 otherwise
2. compute gowers similarity:
> calculate sum of s(xi,xj) over all attributes and divide by the number of possible comparisons
3. calculate the distance: 1 - gowers similarity
person level distance metrics:
> options for non-temporal distance metrics
non-temporal distance metrics between datasets:
1. summarize values using mean, std, max etc, reducing each attribute to one instance
> then apply distance metric
2. describe each attribute by distribution of values
> difference in distribution parameters represents the distance
3. apply statistic test to test whether we can assume that the attribute distributions of person x stem from the same distribution as those of person y
> take 1-p as distance metric
temporal person level distance metrics:
name 2 raw-data based distance metrics (not DTW)
temporal person level distance metrics:
1. sum of the difference between individual points, e.g. by euclidean distance
2. cross-correlation coefficient
> dataseries might be shifted, define lag tau to synchronize the series (minimization problem: find value for lag that minimizes distance over all attributes)
> cross-correlation distance: 1/ sum over pairwise products of the individual points in both series over all attributes
>>> both assume equal number of points
explain dynamic time warping
> advantage over ccc or euclidean?
dynamic time warping
> not only allows for shit, but also for difference in speed between time series
> DWT creates pairs of instances, matching each time point in series a to a point in series b
1. monotonicity - time order needs to be preserved
2. boundary condition - first and last points need to be matched
(only allowed to move up or right)
> find cheapest path - calculate minimum cost for each position in search space:
difference between values in that square
+ cheapest path that leads towards square
> cost of top right square is distance
explain k-means clustering
1. initialize k cluster centers randomly
2. assign each datapoint to the cluster that minimizes the distance between the point and the cluster center
3. recompute the center of each cluster
4. repeat 2+3 until converge
in what cases is k-means clustering inappropriate?
> raw-data based k-means clustering is not appropriate for person level (what does the center of a set of datasets mean?)
>>> we can apply k-medoids clustering
(select point that minimizes the distance to all other points in cluster as center)
two requirements for non-hierarchical clustering?
1. predefined number of clusters
2. only consider non-overlapping clusters
explain agglomerative clustering
> what are 4 metrics for cluster distance?
agglomerative clustering: start with each datapoint being an individual cluster, combine them based on distance between clusters until everything is in one cluster
> distance between clusters:
1.single linkage: distance between the two closest points as distance between clusters
2. complete linkage: distance between the two points that are furthest apart as distance between clusters
3. group average: average distance between the points in the clusters
4. Ward's method: distance between clusters is the increase in standard deviation when merging the clusters
>>> merge clusters by considering maximum distance threshold for which cluster still should be merged
explain divisive clustering
> which cluster to select?
divisive clustering: opposite of agglomerative clustering
> start with single cluster: split cluster C by moving the point with the greatest dissimililarity to C to a new cluster C'
> stop when no point is more dissimilar to C than C'
>>> select the cluster with the largest diameter (maximum distance between points in the cluster)
what are 3 disadvantages of (non)-hierarchical clustering methods for the quantified self setting?
we might have huge attribute space, causing:
1. long computational times
2. calculating distance over large number of attributes problematic (curse of dimensionality)
3. results will not be very insightful
explain subspace clustering
1. create "units" that partition the data space by splitting the range of each variable into epsilon distinct intervals
2. points belong to a unit when they fall within the bounds (for all attributes)
3. only consider "dense" units
4. units form a cluster when they share a common face (indirect connections are allowed)
subspace clustering: what is selectivity?
> when do we call a unit dense?
selectivity: fraction of instances belonging to the unit out of all instances
> a unit is dense when its selectivity exceeds a certain threshold tau
subspace clustering: what is the minimal description of a cluster?
> smallest set of boundary conditions that describes all units in the cluster
subspace clustering: how to efficiently find units that are dense over all dimensions?
1. unit u can only be dense in k dimensions if all units of k-1 dimensions that are a subset of the constraints for unit u are also dense
> unit u in k dimensions is split up into one more dimension than unit u in k-1 dimensions: it occupies a smaller proportion of the data space and can thus never have more data instances in it
>>> start at 1 dimension and work up to k dimensions
what two strong assumptions do all clustering algoritms make? (except datastream clustering)
1. assume that an unlimited amount of data can be stored
2. all data should be treated in the same way
> underlying mechanisms generating the data might change, requiring new models
> this is called temporal locality or concept drift
datastream mining: 3 clustering approaches that do not require the typical clustering assumptions
1. maintain a window of size n and cluster only the last n elements: each newly arriving instance replaces an element in our window (e.g. oldest)
2. store cluster centers for chunks of data >>> apply clustering to the cluster centers, storing only cluster centers and weights of clusters
3. estimate a mixture model of n data points
> update estimates for newly arriving data
how to evaluate clustering performance?
evaluate clustering performance using silhouette score
1. calculate average distance of a point to the other points of a cluster a(x)
2. calculate the average distance to the points in the nearest other cluster b(x)
3. compare a(x) and b(x) and divide by the maximum of the two
>>> provides a measure of how tight the clusters themselves are relative to the distance to the cluster closest to them