Approximations Flashcards

Question

Explain the basics of the count-min sketch

Answer 1

Construct a 2D array of size n x d. For each row d, construct a hash function. When a new value is encountered, hash it with the hash function of each row and increment the counter on the index (hash) by the weight of the value. To query the count of an element, hash the element for each row again, and take the counter that has the minimum value.

Answer 2

The primary reason is that we want to remove noise from our estimation. There will always be some collisions in each row given enough seen elements, meaning that a counter may not be accurate. By taking the minimum value of all counters of that particular value, we minimize the noise as a result from collisions. (We maximize the signal-to-noise ratio)

Answer 3

By setting the width of the 2D array w = [e/ε] and the number of rows d = [ln 1/δ], the probability to have a significnt error is at most delta: Pr( f_pred(j) - f_actual >= epsilon * (||S||) where S are the number of items being summarized (stream length). Epsilon (ε) (width) determines the error bound in the estimate Delta (δ) (depth) determines the probability that the error exceeds the error bound If you make epsilon smaller, then the 2D array becomes wider, reducing the chance for collisions and thus reducing the margin of error. By reducing delta, we add more rows to the sketch, which reduces the noise, and thus lowers the probability of getting outside of the error margins. The trade-off is that by lowering ε and δ we increase computational and memory cost.

Answer 4

If our data streams contains frequently occuring values. The error term in the probability of significant errors is constant and not relative. This means that the error on the count for infrequently occuring values can be relatively large. Example: If our ε is 0.05 and our stream length is 1000, then our margin of error is 50. If we see some values only once, then 50 is a large margin of error. Thus, CM sketches are most useful when dealing with data streams that have a high frequency of recurring values, as their constant error term provides more accurate estimations for these values

Answer 5

To use CM sketches for range queries, we must first count the number of possible dyadic interval scales over the expected domain range. For example, when our domain consists of all integer values, the maximum scale is 32, as integers represent 2³² values. For each scale, we create a separate CM sketch. Each sketch represents a particular granularity over the dyadic intervals. For instance, sketch 0 (2⁰) summarizes all individual numbers in the integer domain, while sketch 2 (2²) summarizes the ranges (if starting at 0) [0-3], [4-7], [8-11], and so on. When a new data point arrives, we calculate the dyadic interval to which the arrival belongs for each sketch. We then increment the counter for this range in the sketch and repeat the process for the other sketches. To answer a range query, we break the range query into dyadic intervals, retrieve the count for each interval in the corresponding sketch (based on the scale), and sum them up.

Answer 6

Composability means that you can construct two CM sketches on two sets of (similar) data and combine them later on. This assumes both CM sketches are constructed using the **same hash functions**! The counters can be summed in element-wise fashion to combine two sketches.

Answer 7

We can do that by using CM sketches. Given two sets you want to perform an inner join on, you get the count-min sketches and elementwise multiply the counters. Afterwards for each row, sum all the values together. To get an estimation of the join size, take the min value from the resulting array.

Answer 8

A window is a subpart of the stream for which we want to perform some operation. For example we would like to count the number of true bits in a window of a binary string. A jumping window partitions the window into smaller parts (of size **z**). For each subwindow we calulcate the statistic like the count. We update the total count every time a sub-window completes, by summing the counts of all the subwindows. Using more subwindows will of provide better results, as the granularity increases. The maximum error is **z**. Problems: We don't know the window size beforehand. One user might want a count of the last 3 arrivals, while another of the last 100. The error is also not relative to the answer, but only to the window size.

Answer 9

O(1/ε log² *W*) Where W is the maximum size of our sliding window that we support and ε is the maximum relative error.

Answer 10

Example Query: count all true bits arrived in the last t_q arrivals. Start from the current time and go to the buckets back in time, until t_q is covered. Sum up all the counts from all buckets, except for the last bucket that is not fully covered. Take **half** of the value of this bucket and sum it with the rest. The maximum absolute error is therefore b / 2, where b is the bucket that was not fully covered.

Answer 11

b_j / (2 (1 + Σ_{(last to i=j+1)}b_i <= epsilon Where: - Σ_{(last to i=j+1)}b_i is the size (summed weights) of the buckets that come after j. - b_j is the weight (max count) of bucket *j* - b_i is the weight (max count) of bucket *i* (I haven't had time to actually show that MRE <= Epsilon, so message me if you know)

Answer 12

Invariant 2: For every bucket size other than the oldest, there are at most (k/2 + 1) and at least k/2 buckets of that size, where k = 1 / epsilon. For the oldest bucket size, there are at most (k/2 + 1) buckets of that size. For new arrivals, add a new bucket of size 1 to the right and check invariant 2: if violated at buckets of size i, merge 2 buckets of size i to create a new bucket of size 2i. Recursively check invariant 2 until it has been satisfied.

Approximations Flashcards

(36 cards)