Chapter 5 Flashcards

Question 1

Q

explain the difference between the euclidean and the manhattan distance

> how to choose?

Answer

A

euclidean: simply the distance between two points
manhattan: points can only be connected by moving horizontally or vertically and not diagonally (hence manhattan)

> choice highly depends on dataset and can mostly only be determined after experimentation

Question 2

Q

what is the minkowski-distance?

Answer

A

a combination of euclidean and manhattan distance

> minkowski(1,x,y) = manhattan distance

> minkowski(2,x,y) = euclidean distance

Question 3

Q

what to consider when using euclidean or manhattan distance?

Answer

A

argue whether scaling is needed

> risk of certain attributes with high magnitude and spread might dominate distance calculations

Question 4

Q

explain Gowers similarity

> when can this only be used?

> difference between binary/categorical/numerical attributes?

Answer

A

Gowers similarity: attribute k can only be used when it has values for both instances i and j

compute similarity for all attributes k: outcome is always scaled to [0, 1]
binary: 1 if xi and xj are both present, 0 otherwise
categorical: 1 if xi = xj, 0 otherwise
numeric: formula
compute gowers similarity:

> calculate sum of s(xi,xj) over all attributes and divide by the number of possible comparisons

calculate the distance: 1 - gowers similarity

Question 5

Q

person level distance metrics:

> options for non-temporal distance metrics

Answer

A

non-temporal distance metrics between datasets:

summarize values using mean, std, max etc, reducing each attribute to one instance

> then apply distance metric

describe each attribute by distribution of values

> difference in distribution parameters represents the distance

apply statistic test to test whether we can assume that the attribute distributions of person x stem from the same distribution as those of person y

> take 1-p as distance metric

Question 6

Q

temporal person level distance metrics:

name 2 raw-data based distance metrics (not DTW)

> assumptions?

Answer

A

temporal person level distance metrics:

sum of the difference between individual points, e.g. by euclidean distance
cross-correlation coefficient

> dataseries might be shifted, define lag tau to synchronize the series (minimization problem: find value for lag that minimizes distance over all attributes)

> cross-correlation distance: 1/ sum over pairwise products of the individual points in both series over all attributes

>>> both assume equal number of points

Question 7

Q

explain dynamic time warping

> advantage over ccc or euclidean?

> constraints?

Answer

A

dynamic time warping

> not only allows for shit, but also for difference in speed between time series

> DWT creates pairs of instances, matching each time point in series a to a point in series b

> constraints:

monotonicity - time order needs to be preserved
boundary condition - first and last points need to be matched

(only allowed to move up or right)

> find cheapest path - calculate minimum cost for each position in search space:

difference between values in that square

+ cheapest path that leads towards square

> cost of top right square is distance

Question 8

Q

explain k-means clustering

Answer

A

k-means:

initialize k cluster centers randomly
assign each datapoint to the cluster that minimizes the distance between the point and the cluster center
recompute the center of each cluster
repeat 2+3 until converge

Question 9

Q

in what cases is k-means clustering inappropriate?

> solution?

Answer

A

> raw-data based k-means clustering is not appropriate for person level (what does the center of a set of datasets mean?)

>>> we can apply k-medoids clustering

(select point that minimizes the distance to all other points in cluster as center)

Question 10

Q

two requirements for non-hierarchical clustering?

Answer

A

predefined number of clusters
only consider non-overlapping clusters

Question 11

Q

explain agglomerative clustering

> what are 4 metrics for cluster distance?

Answer

A

agglomerative clustering: start with each datapoint being an individual cluster, combine them based on distance between clusters until everything is in one cluster

> distance between clusters:

single linkage: distance between the two closest points as distance between clusters
complete linkage: distance between the two points that are furthest apart as distance between clusters
group average: average distance between the points in the clusters
Ward’s method: distance between clusters is the increase in standard deviation when merging the clusters

>>> merge clusters by considering maximum distance threshold for which cluster still should be merged

Question 12

Q

explain divisive clustering

> which cluster to select?

Answer

A

divisive clustering: opposite of agglomerative clustering

> start with single cluster: split cluster C by moving the point with the greatest dissimililarity to C to a new cluster C’

> stop when no point is more dissimilar to C than C’

>>> select the cluster with the largest diameter (maximum distance between points in the cluster)

Question 13

Q

what are 3 disadvantages of (non)-hierarchical clustering methods for the quantified self setting?

Answer

A

we might have huge attribute space, causing:

long computational times
calculating distance over large number of attributes problematic (curse of dimensionality)
results will not be very insightful

Question 14

Q

explain subspace clustering

Answer

A

subspace clustering:

create “units” that partition the data space by splitting the range of each variable into epsilon distinct intervals
points belong to a unit when they fall within the bounds (for all attributes)
only consider “dense” units
units form a cluster when they share a common face (indirect connections are allowed)

Question 15

Q

subspace clustering: what is selectivity?

> when do we call a unit dense?

Answer

A

selectivity: fraction of instances belonging to the unit out of all instances

> a unit is dense when its selectivity exceeds a certain threshold tau

Question 16

Q

subspace clustering: what is the minimal description of a cluster?

Answer

Study These Flashcards

A

minimal description:

> smallest set of boundary conditions that describes all units in the cluster

Question 17

Q

subspace clustering: how to efficiently find units that are dense over all dimensions?

Answer

Study These Flashcards

A

unit u can only be dense in k dimensions if all units of k-1 dimensions that are a subset of the constraints for unit u are also dense

> unit u in k dimensions is split up into one more dimension than unit u in k-1 dimensions: it occupies a smaller proportion of the data space and can thus never have more data instances in it

>>> start at 1 dimension and work up to k dimensions

Question 18

Q

what two strong assumptions do all clustering algoritms make? (except datastream clustering)

Answer

Study These Flashcards

A

assume that an unlimited amount of data can be stored
all data should be treated in the same way

> underlying mechanisms generating the data might change, requiring new models

> this is called temporal locality or concept drift

Question 19

Q

datastream mining: 3 clustering approaches that do not require the typical clustering assumptions

Answer

Study These Flashcards

A

maintain a window of size n and cluster only the last n elements: each newly arriving instance replaces an element in our window (e.g. oldest)
store cluster centers for chunks of data >>> apply clustering to the cluster centers, storing only cluster centers and weights of clusters
estimate a mixture model of n data points

> update estimates for newly arriving data

Question 20

Q

how to evaluate clustering performance?

Answer

Study These Flashcards

A

evaluate clustering performance using silhouette score

calculate average distance of a point to the other points of a cluster a(x)
calculate the average distance to the points in the nearest other cluster b(x)
compare a(x) and b(x) and divide by the maximum of the two

>>> provides a measure of how tight the clusters themselves are relative to the distance to the cluster closest to them

Chapter 5 Flashcards

(20 cards)