# Approximations Flashcards

1
Q

Provide two important parameters of probabilistic bounds

A

Epsilon (ϵ): Epsilon represents the error bound on the estimate. The computed answer will be within a factor of ϵ of the actual answer with a probability of at least 1-δ. In other words, a smaller epsilon value means a more accurate estimate, while a larger epsilon value indicates a less accurate estimate.
(Depending on the sketch) Note that the error bound is proportional to the actual answer, so the relative error becomes smaller as the actual value increases depending on the sketch)

Delta (δ): Delta represents the confidence level of the estimate. With a probability of at least 1-δ, the computed answer will be within the error bounds specified by epsilon. In other words, a smaller delta value indicates a higher confidence level in the accuracy of the estimate, while a larger delta value represents a lower confidence level.

2
Q

Summarize the difference between Equi-width and Equi-depth histograms.

A

Equi-width: Divide the distinct values of your dataset into equally sized buckets in order. e.g. for width = 4, [1..8] -> [[1..4], [5..8]]. If we’re interested in frequencies, we count the frequencies of the whole bucket. The information that is stored per bucket is the starting and ending number and the frequency. We lose information as we cannot tell anything about the internal distribution of the values contained in a bucket.

Equi-depth: Instead of grouping on the number of distinct values in the dataset, we group by the weight of each distinct value, so that we get equally large buckets.The bucket size will be fixed on a number but if one value has a high frequency, than the bucket will contain a larger portion of this value.

3
Q

Is the equi-width or equi-depth histogram more accurate? Which can be used for guaranteeing accuracy?

A

Equi-width cannot do that. Because the estimation you will be given will be the size of the bucket over the length of the bucket. for example if the size is 16 (frequency is 16) and the length is 2 (there are 2 distinct values in this bucket), the maximum error will be related to the size of the bucket

4
Q

How can histograms be used to estimate the join cardinality? What is the issue with this?

A

When you have two columns of data, the join cardinality can be estimated (inaccurately) by calculating the histograms over each columns. Then, sum up the element-wise products of the histograms to get an estimation on the join cardinality.

The problem is that we have no accuracy guarantees as we lose information about the distribution of the samples when creating histograms.

5
Q

Why do multi-dimensional histograms not save a lot of space? Give an example.

A

When adding more dimensions to the histogram, the histogram will become more sparse as the combination of two values in 2D becomes less probable for two different points.

An example:
We have three columns, a price column, a product column and a quantity column. The product has a price, and the quantity specifies how many of these products were bought in a single order. We can then create a 2-D histogram to see the distribution of quantities per order for products in a specific price-range.

Quantity: 1-2 Quantity: 3-4 Quantity: 5-6
Price: 0-10 | 1 1 0
Price: 11-20 | 2 0 0
Price: 21-30 | 1 0 0

In this case the table grows exponentially in the number of dimensions, which reduces the benefit of using histograms for this purpose.

6
Q

Why is it difficult to construct histograms in a streaming fashion? Give an argument for both equi-width and equi-depth histograms.

A

Equi-depth: We need to know the distribution a-priori. We cannot set appropriate boundaries if we do not know anything about the distribution of the data. If the boundaries do not capture the distribution of values, the histogram is less useful.

Equi-width: We need to know the domain (range of values), e.g. min and max salary. Equi-width histograms divide the data into equally sized intervals (buckets) based on the range of values in the data. To determine the appropriate bucket boundaries, you need to know the domain (min and max) of the data. We would also need to know the domain so that we include all possible values into the histogram.

7
Q

Provide three characteristics of sketches

A
2. They are logarithmic in space (sublinear to items)
3. They have probabilistic guarantees on the quality of the approximate answer
8
Q

Provide the steps you take when designing your own sketch

A
1. Define the functionality (e.g. support containment queries)
2. Define the input domain. What should the sketch accept? What is it we want to summarize. (e.g. integers: [0..2^32-1])
3. Define the output domain. (e.g. map it to 8 bits) (sketch)
4. Define the mapping function from the input to the output domain (e.g. hash function h(x) = | (3x +2) % 8|
9
Q

Name four important properties of a hash function

A
1. It should evenly distribute items to the output domain
2. This should be done independent of the previously hashed elements
3. If using multiple hash functions for the same task, make sure they are pairwise independent, namely they don’t use the same components. Example of pairwise dependent hash functions:

h1(x) = |(7x + 11) % 17 | % 8
h2(x) = |3x + 11) % 13 | % 8

1. Make sure to use large prime numbers
10
Q

How do you create a very simple sketch for a containment query, how do you query it?

A

Create:
1. Turn your input into an integer
2. Initialize a bitvector with size m
3. hash your input so that it maps to an index <= m
4. Set the bit at the index to 1
5. If there is already a 1 there, do nothing

Query:
2. check if the bit in the bitvector at the hash index is set to 1

Better methods are Bloom filters or Cuckoo filters

11
Q

How do you calculate the space complexity for a simple containment query sketch?

A

O(m) = O(n/ln(1-prcollision))
- Where n are the number of inputs
- and prcollision is probability of a collision.

12
Q

What is meant with composability and idempotency?

A

Composability: a sketch is composable if it can be built in a distributed fashion and combined later.

Idempotency: Repeated insertion do not change the synopsis.

13
Q

Explain the benefits of repetition in approximations.

A

Let’s say we have a randomized algorithm with error probability p (and correctness probability 1-p)

If we repeat this algorithm k times, the correctness probability evalutes to : 1 - p * p * p … p = (1-pk)

The correctness probability increases with the number of independent trials.

14
Q

Explain the main components of bloom filters and how it can be used for a containment query.

A
• A bit array of length m, with all bits initially set to 0
• k pariwise independent hash function

When we observe a new value, we hash the value with each hash function to obtain an index and we set the resulting index of the bit array to 1. Then when we want to query if a value is contained in the set, we hash the value with each hash function and check if the elements of the resulting indices are 1 in the bit vector. If any of them are not 1, we return false, else we return true.

15
Q

What is the probability of a false positive in a bloom filter?

A

prfp= ( 1 - ( 1- 1/m)kn ) k = (1-e-kn/m)k

The FP probability of a bloom filter scales exponentially with the number of items given a set number of hash functions.

16
Q

How do you determine the optimal number of hash functions for a bloom filter?

A

k = (m / nmax) ln 2
where:
- k = number of hash function
- m are the number of bits in your array
- and nmax are the maximum number of distinct items

17
Q

How do can we estimate the (optimal) length of the bitarray of a bloom filter, given the estimated number of distinct items we can encounter and the false positive rate of the filter?

A

(Indirectly mentioned in the videos, but we had to figure it out ourselves)

The slides mention:
prfp = 0.6185𝑚/𝑛

So:
m = n * log(prfp) / log(0.6185));

18
Q

Can we remove items from a bloom filter? How should we modify the basic bloom filter data structure?

A

No, for that we need to use counting bloom filter, where instead of having a bit per element in the filter, we have a small counter. We again hash the original values and increment the counters on the indices with 1. When we want to delete an item, we decrement all counters of which the hash of the value points to.

19
Q

Give the space and computational complexity of (counting) bloom filters

A

Space: O(m)
Computation: O(k) per item insert / query

Counting bloom filters:
Space: O(nr bits per counter * m)

Where:
- m is the size of the filter
- k are the number of hash functions

20
Q

Provide the most important components of Cuckoo filters and how they work

A
1. A signature function: a hash function that decreases the input in size (hash to integer)
2. h(): A hash function that takes an input and hashes it to m - 1 (number of rows)
3. A pair of hash functions derived from the previous hash function:
- h1(x) = h(x)
- h2(x) = h(x) ⊕ h(sign(x))
( and similarly, h1(x) = h2(x) ⊕ h(sign(x))

We hash the value using h1(x) and h2. We check if either buckets hash space.
- yes -> store signature to an empty position
- no -> remove and rehash one of the other signatures.

Recursively apply until insertion has succeeded, or we have reached the MAX_RETRIES paramter. In the latter case, return failure.

21
Q

How do collisions form in Cuckoo filters? How can we lower the false positive rate?

A

Two signatures of different numbers can be the same, hence when querying it is possible to get false-positives. They can be reduced by increasing the size of the signature (more bits) or by increasing the size of the cuckoo filter.

22
Q

What are 6 the most important properties of FM-sketches?

A
• Single pass: each record examined at most once
• Small space: O(1/epsilon2 log 1/delta)
• Real-time: O(1/epsilon2 log 1/delta)
• Not delete-proof
• Composable: Bitwise OR of the bits is possible if the hash functions used to construct the bit arrays are the same.
• Idempotent: Repeated insertions do not change the synopsis.

(number of hash functions and bitarrays = 1/epsilon2 log 1/delta)

23
Q

Provide the basic estimation technique (query) of distinct values when using FM sketches.

A

After the bit array has been created from the datastream, we find the first 0 from the least significant bit. Take the index of that element, and insert into function:
d = c * 2R

Where
- d is the estimation of distinct values of the stream
- R is the index of the first 0 from the LSB
- c is a scaling constant (e.g. 1.3)

Averaging many copies (with different hash functions) improves accuracy. With O(1/epsilon2 log 1/delta) copies, get (epsiolon, delta) approximation.

24
Q

Explain how values are hashed in FM sketches.

A

FM sketches uses a hash function that maps a value to the bit array defined in the sketch. This hash function can be constructed from a uniform hash function that maps a value into a certain domain (e.g. [0..232]), then the position of the least significant bit is taken as the index for the bit vector. This means that in the binary string representation of the hashed value, we look at the index of the right-most bit that is set to 1. This index will be used for where we set the bit to 1 in the FM sketch. Example:
Uniform hash function h1(x) = (3x + 7 ) % 11

h1(5) = 10 = 0b1010
LSB = 1 (second from the right)

h1(6) = 3 = 0b11
LSB = 0

So our FM sketch looks like: …000000000011

25
Q

Explain the basics of the count-min sketch

A

Construct a 2D array of size n x d.
For each row d, construct a hash function.
When a new value is encountered, hash it with the hash function of each row and increment the counter on the index (hash) by the weight of the value.

To query the count of an element, hash the element for each row again, and take the counter that has the minimum value.

26
Q

Why do we take the minimum value in CM sketches?

A

The primary reason is that we want to remove noise from our estimation. There will always be some collisions in each row given enough seen elements, meaning that a counter may not be accurate. By taking the minimum value of all counters of that particular value, we minimize the noise as a result from collisions.

(We maximize the signal-to-noise ratio)

27
Q

How can we control the accuracy of the CM sketch? What is the trade-off?

A

By setting the width of the 2D array w = [e/ε] and the number of rows d = [ln 1/δ], the probability to have a significnt error is at most delta: Pr( fpred(j) - factual >= epsilon * (||S||) where S are the number of items being summarized (stream length).

Epsilon (ε) (width) determines the error bound in the estimate
Delta (δ) (depth) determines the probability that the error exceeds the error bound

If you make epsilon smaller, then the 2D array becomes wider, reducing the chance for collisions and thus reducing the margin of error. By reducing delta, we add more rows to the sketch, which reduces the noise, and thus lowers the probability of getting outside of the error margins.

The trade-off is that by lowering ε and δ we increase computational and memory cost.

28
Q

In which cases are CM sketches most useful and why?

A

If our data streams contains frequently occuring values. The error term in the probability of significant errors is constant and not relative. This means that the error on the count for infrequently occuring values can be relatively large. Example:
If our ε is 0.05 and our stream length is 1000, then our margin of error is 50. If we see some values only once, then 50 is a large margin of error.

Thus, CM sketches are most useful when dealing with data streams that have a high frequency of recurring values, as their constant error term provides more accurate estimations for these values

29
Q

How can we combine CM sketches to support range queries?

A

To use CM sketches for range queries, we must first count the number of possible dyadic interval scales over the expected domain range. For example, when our domain consists of all integer values, the maximum scale is 32, as integers represent 232 values.

For each scale, we create a separate CM sketch. Each sketch represents a particular granularity over the dyadic intervals. For instance, sketch 0 (20) summarizes all individual numbers in the integer domain, while sketch 2 (22) summarizes the ranges (if starting at 0) [0-3], [4-7], [8-11], and so on.

When a new data point arrives, we calculate the dyadic interval to which the arrival belongs for each sketch. We then increment the counter for this range in the sketch and repeat the process for the other sketches.

To answer a range query, we break the range query into dyadic intervals, retrieve the count for each interval in the corresponding sketch (based on the scale), and sum them up.

30
Q

What does composability mean in CM sketches and how can they be composed.

A

Composability means that you can construct two CM sketches on two sets of (similar) data and combine them later on. This assumes both CM sketches are constructed using the same hash functions!

The counters can be summed in element-wise fashion to combine two sketches.

31
Q

How can you estimate the inner join size of a set of observed values?

A

We can do that by using CM sketches. Given two sets you want to perform an inner join on, you get the count-min sketches and elementwise multiply the counters. Afterwards for each row, sum all the values together. To get an estimation of the join size, take the min value from the resulting array.

32
Q

How do jumping windows work in data streams?

A

A window is a subpart of the stream for which we want to perform some operation. For example we would like to count the number of true bits in a window of a binary string.

A jumping window partitions the window into smaller parts (of size z). For each subwindow we calulcate the statistic like the count. We update the total count every time a sub-window completes, by summing the counts of all the subwindows.

Using more subwindows will of provide better results, as the granularity increases. The maximum error is z.

Problems: We don’t know the window size beforehand. One user might want a count of the last 3 arrivals, while another of the last 100. The error is also not relative to the answer, but only to the window size.

33
Q

What’s the space complexity of exponential histograms?

A

O(1/ε log2 W)
Where W is the maximum size of our sliding window that we support and ε is the maximum relative error.

34
Q

How can Exponential Histograms be queried? What is the maximum absolute error?

A

Example Query: count all true bits arrived in the last tq arrivals.

Start from the current time and go to the buckets back in time, until tq is covered. Sum up all the counts from all buckets, except for the last bucket that is not fully covered. Take half of the value of this bucket and sum it with the rest.

The maximum absolute error is therefore b / 2, where b is the bucket that was not fully covered.

35
Q

In Exponential histograms we have approximation guarantees based on two invariants. Explain invariant 1 and show that the maximum relative error <= epsilon

A

bj / (2 (1 + Σ(last to i=j+1)bi <= epsilon
Where:
- Σ(last to i=j+1)bi is the size (summed weights) of the buckets that come after j.
- bj is the weight (max count) of bucket j
- bi is the weight (max count) of bucket i

(I haven’t had time to actually show that MRE <= Epsilon, so message me if you know)

36
Q

In Exponential histograms we have approximation guarantees based on two invariants. Explain invariant 2 and what role it play in updating the exponential histogram.

A

Invariant 2: For every bucket size other than the oldest, there are at most (k/2 + 1) and at least k/2 buckets of that size, where k = 1 / epsilon. For the oldest bucket size, there are at most (k/2 + 1) buckets of that size.

For new arrivals, add a new bucket of size 1 to the right and check invariant 2: if violated at buckets of size i, merge 2 buckets of size i to create a new bucket of size 2i. Recursively check invariant 2 until it has been satisfied.