Chapter 5 Flashcards

(20 cards)

1
Q

explain the difference between the euclidean and the manhattan distance

> how to choose?

A

euclidean: simply the distance between two points
manhattan: points can only be connected by moving horizontally or vertically and not diagonally (hence manhattan)

> choice highly depends on dataset and can mostly only be determined after experimentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the minkowski-distance?

A

a combination of euclidean and manhattan distance

> minkowski(1,x,y) = manhattan distance

> minkowski(2,x,y) = euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what to consider when using euclidean or manhattan distance?

A

argue whether scaling is needed

> risk of certain attributes with high magnitude and spread might dominate distance calculations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

explain Gowers similarity

> when can this only be used?

> difference between binary/categorical/numerical attributes?

A

Gowers similarity: attribute k can only be used when it has values for both instances i and j

  1. compute similarity for all attributes k: outcome is always scaled to [0, 1]
    binary: 1 if xi and xj are both present, 0 otherwise
    categorical: 1 if xi = xj, 0 otherwise
    numeric: formula
  2. compute gowers similarity:

> calculate sum of s(xi,xj) over all attributes and divide by the number of possible comparisons

  1. calculate the distance: 1 - gowers similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

person level distance metrics:

> options for non-temporal distance metrics

A

non-temporal distance metrics between datasets:

  1. summarize values using mean, std, max etc, reducing each attribute to one instance

> then apply distance metric

  1. describe each attribute by distribution of values

> difference in distribution parameters represents the distance

  1. apply statistic test to test whether we can assume that the attribute distributions of person x stem from the same distribution as those of person y

> take 1-p as distance metric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

temporal person level distance metrics:

name 2 raw-data based distance metrics (not DTW)

> assumptions?

A

temporal person level distance metrics:

  1. sum of the difference between individual points, e.g. by euclidean distance
  2. cross-correlation coefficient

> dataseries might be shifted, define lag tau to synchronize the series (minimization problem: find value for lag that minimizes distance over all attributes)

> cross-correlation distance: 1/ sum over pairwise products of the individual points in both series over all attributes

>>> both assume equal number of points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

explain dynamic time warping

> advantage over ccc or euclidean?

> constraints?

A

dynamic time warping

> not only allows for shit, but also for difference in speed between time series

> DWT creates pairs of instances, matching each time point in series a to a point in series b

> constraints:

  1. monotonicity - time order needs to be preserved
  2. boundary condition - first and last points need to be matched

(only allowed to move up or right)

> find cheapest path - calculate minimum cost for each position in search space:

difference between values in that square

+ cheapest path that leads towards square

> cost of top right square is distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

explain k-means clustering

A

k-means:

  1. initialize k cluster centers randomly
  2. assign each datapoint to the cluster that minimizes the distance between the point and the cluster center
  3. recompute the center of each cluster
  4. repeat 2+3 until converge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

in what cases is k-means clustering inappropriate?

> solution?

A

> raw-data based k-means clustering is not appropriate for person level (what does the center of a set of datasets mean?)

>>> we can apply k-medoids clustering

(select point that minimizes the distance to all other points in cluster as center)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

two requirements for non-hierarchical clustering?

A
  1. predefined number of clusters
  2. only consider non-overlapping clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

explain agglomerative clustering

> what are 4 metrics for cluster distance?

A

agglomerative clustering: start with each datapoint being an individual cluster, combine them based on distance between clusters until everything is in one cluster

> distance between clusters:

  1. single linkage: distance between the two closest points as distance between clusters
  2. complete linkage: distance between the two points that are furthest apart as distance between clusters
  3. group average: average distance between the points in the clusters
  4. Ward’s method: distance between clusters is the increase in standard deviation when merging the clusters

>>> merge clusters by considering maximum distance threshold for which cluster still should be merged

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

explain divisive clustering

> which cluster to select?

A

divisive clustering: opposite of agglomerative clustering

> start with single cluster: split cluster C by moving the point with the greatest dissimililarity to C to a new cluster C’

> stop when no point is more dissimilar to C than C’

>>> select the cluster with the largest diameter (maximum distance between points in the cluster)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are 3 disadvantages of (non)-hierarchical clustering methods for the quantified self setting?

A

we might have huge attribute space, causing:

  1. long computational times
  2. calculating distance over large number of attributes problematic (curse of dimensionality)
  3. results will not be very insightful
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

explain subspace clustering

A

subspace clustering:

  1. create “units” that partition the data space by splitting the range of each variable into epsilon distinct intervals
  2. points belong to a unit when they fall within the bounds (for all attributes)
  3. only consider “dense” units
  4. units form a cluster when they share a common face (indirect connections are allowed)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

subspace clustering: what is selectivity?

> when do we call a unit dense?

A

selectivity: fraction of instances belonging to the unit out of all instances

> a unit is dense when its selectivity exceeds a certain threshold tau

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

subspace clustering: what is the minimal description of a cluster?

A

minimal description:

> smallest set of boundary conditions that describes all units in the cluster

17
Q

subspace clustering: how to efficiently find units that are dense over all dimensions?

A
  1. unit u can only be dense in k dimensions if all units of k-1 dimensions that are a subset of the constraints for unit u are also dense

> unit u in k dimensions is split up into one more dimension than unit u in k-1 dimensions: it occupies a smaller proportion of the data space and can thus never have more data instances in it

>>> start at 1 dimension and work up to k dimensions

18
Q

what two strong assumptions do all clustering algoritms make? (except datastream clustering)

A
  1. assume that an unlimited amount of data can be stored
  2. all data should be treated in the same way

> underlying mechanisms generating the data might change, requiring new models

> this is called temporal locality or concept drift

19
Q

datastream mining: 3 clustering approaches that do not require the typical clustering assumptions

A
  1. maintain a window of size n and cluster only the last n elements: each newly arriving instance replaces an element in our window (e.g. oldest)
  2. store cluster centers for chunks of data >>> apply clustering to the cluster centers, storing only cluster centers and weights of clusters
  3. estimate a mixture model of n data points

> update estimates for newly arriving data

20
Q

how to evaluate clustering performance?

A

evaluate clustering performance using silhouette score

  1. calculate average distance of a point to the other points of a cluster a(x)
  2. calculate the average distance to the points in the nearest other cluster b(x)
  3. compare a(x) and b(x) and divide by the maximum of the two

>>> provides a measure of how tight the clusters themselves are relative to the distance to the cluster closest to them