Approximations Flashcards
Provide two important parameters of probabilistic bounds
Epsilon (ϵ): Epsilon represents the error bound on the estimate. The computed answer will be within a factor of ϵ of the actual answer with a probability of at least 1-δ. In other words, a smaller epsilon value means a more accurate estimate, while a larger epsilon value indicates a less accurate estimate.
(Depending on the sketch) Note that the error bound is proportional to the actual answer, so the relative error becomes smaller as the actual value increases depending on the sketch)
Delta (δ): Delta represents the confidence level of the estimate. With a probability of at least 1-δ, the computed answer will be within the error bounds specified by epsilon. In other words, a smaller delta value indicates a higher confidence level in the accuracy of the estimate, while a larger delta value represents a lower confidence level.
Summarize the difference between Equi-width and Equi-depth histograms.
Equi-width: Divide the distinct values of your dataset into equally sized buckets in order. e.g. for width = 4, [1..8] -> [[1..4], [5..8]]. If we’re interested in frequencies, we count the frequencies of the whole bucket. The information that is stored per bucket is the starting and ending number and the frequency. We lose information as we cannot tell anything about the internal distribution of the values contained in a bucket.
Equi-depth: Instead of grouping on the number of distinct values in the dataset, we group by the weight of each distinct value, so that we get equally large buckets.The bucket size will be fixed on a number but if one value has a high frequency, than the bucket will contain a larger portion of this value.
Is the equi-width or equi-depth histogram more accurate? Which can be used for guaranteeing accuracy?
Equi-width cannot do that. Because the estimation you will be given will be the size of the bucket over the length of the bucket. for example if the size is 16 (frequency is 16) and the length is 2 (there are 2 distinct values in this bucket), the maximum error will be related to the size of the bucket
How can histograms be used to estimate the join cardinality? What is the issue with this?
When you have two columns of data, the join cardinality can be estimated (inaccurately) by calculating the histograms over each columns. Then, sum up the element-wise products of the histograms to get an estimation on the join cardinality.
The problem is that we have no accuracy guarantees as we lose information about the distribution of the samples when creating histograms.
Why do multi-dimensional histograms not save a lot of space? Give an example.
When adding more dimensions to the histogram, the histogram will become more sparse as the combination of two values in 2D becomes less probable for two different points.
An example:
We have three columns, a price column, a product column and a quantity column. The product has a price, and the quantity specifies how many of these products were bought in a single order. We can then create a 2-D histogram to see the distribution of quantities per order for products in a specific price-range.
Quantity: 1-2 Quantity: 3-4 Quantity: 5-6
Price: 0-10 | 1 1 0
Price: 11-20 | 2 0 0
Price: 21-30 | 1 0 0
In this case the table grows exponentially in the number of dimensions, which reduces the benefit of using histograms for this purpose.
Why is it difficult to construct histograms in a streaming fashion? Give an argument for both equi-width and equi-depth histograms.
Equi-depth: We need to know the distribution a-priori. We cannot set appropriate boundaries if we do not know anything about the distribution of the data. If the boundaries do not capture the distribution of values, the histogram is less useful.
Equi-width: We need to know the domain (range of values), e.g. min and max salary. Equi-width histograms divide the data into equally sized intervals (buckets) based on the range of values in the data. To determine the appropriate bucket boundaries, you need to know the domain (min and max) of the data. We would also need to know the domain so that we include all possible values into the histogram.
Provide three characteristics of sketches
- They compute Approximate answers
- They are logarithmic in space (sublinear to items)
- They have probabilistic guarantees on the quality of the approximate answer
Provide the steps you take when designing your own sketch
- Define the functionality (e.g. support containment queries)
- Define the input domain. What should the sketch accept? What is it we want to summarize. (e.g. integers: [0..2^32-1])
- Define the output domain. (e.g. map it to 8 bits) (sketch)
- Define the mapping function from the input to the output domain (e.g. hash function h(x) = | (3x +2) % 8|
Name four important properties of a hash function
- It should evenly distribute items to the output domain
- This should be done independent of the previously hashed elements
- If using multiple hash functions for the same task, make sure they are pairwise independent, namely they don’t use the same components. Example of pairwise dependent hash functions:
h1(x) = |(7x + 11) % 17 | % 8
h2(x) = |3x + 11) % 13 | % 8
- Make sure to use large prime numbers
How do you create a very simple sketch for a containment query, how do you query it?
Create:
1. Turn your input into an integer
2. Initialize a bitvector with size m
3. hash your input so that it maps to an index <= m
4. Set the bit at the index to 1
5. If there is already a 1 there, do nothing
Query:
1. hash your input value
2. check if the bit in the bitvector at the hash index is set to 1
Better methods are Bloom filters or Cuckoo filters
How do you calculate the space complexity for a simple containment query sketch?
O(m) = O(n/ln(1-prcollision))
- Where n are the number of inputs
- and prcollision is probability of a collision.
What is meant with composability and idempotency?
Composability: a sketch is composable if it can be built in a distributed fashion and combined later.
Idempotency: Repeated insertion do not change the synopsis.
Explain the benefits of repetition in approximations.
Let’s say we have a randomized algorithm with error probability p (and correctness probability 1-p)
If we repeat this algorithm k times, the correctness probability evalutes to : 1 - p * p * p … p = (1-pk)
The correctness probability increases with the number of independent trials.
Explain the main components of bloom filters and how it can be used for a containment query.
- A bit array of length m, with all bits initially set to 0
- k pariwise independent hash function
When we observe a new value, we hash the value with each hash function to obtain an index and we set the resulting index of the bit array to 1. Then when we want to query if a value is contained in the set, we hash the value with each hash function and check if the elements of the resulting indices are 1 in the bit vector. If any of them are not 1, we return false, else we return true.
What is the probability of a false positive in a bloom filter?
prfp= ( 1 - ( 1- 1/m)kn ) k = (1-e-kn/m)k
The FP probability of a bloom filter scales exponentially with the number of items given a set number of hash functions.
How do you determine the optimal number of hash functions for a bloom filter?
k = (m / nmax) ln 2
where:
- k = number of hash function
- m are the number of bits in your array
- and nmax are the maximum number of distinct items
How do can we estimate the (optimal) length of the bitarray of a bloom filter, given the estimated number of distinct items we can encounter and the false positive rate of the filter?
(Indirectly mentioned in the videos, but we had to figure it out ourselves)
The slides mention:
prfp = 0.6185𝑚/𝑛
So:
m = n * log(prfp) / log(0.6185));
Can we remove items from a bloom filter? How should we modify the basic bloom filter data structure?
No, for that we need to use counting bloom filter, where instead of having a bit per element in the filter, we have a small counter. We again hash the original values and increment the counters on the indices with 1. When we want to delete an item, we decrement all counters of which the hash of the value points to.
Give the space and computational complexity of (counting) bloom filters
Space: O(m)
Computation: O(k) per item insert / query
Counting bloom filters:
Space: O(nr bits per counter * m)
Where:
- m is the size of the filter
- k are the number of hash functions
Provide the most important components of Cuckoo filters and how they work
- A signature function: a hash function that decreases the input in size (hash to integer)
- h(): A hash function that takes an input and hashes it to m - 1 (number of rows)
- A pair of hash functions derived from the previous hash function:
- h1(x) = h(x)
- h2(x) = h(x) ⊕ h(sign(x))
( and similarly, h1(x) = h2(x) ⊕ h(sign(x))
We hash the value using h1(x) and h2. We check if either buckets hash space.
- yes -> store signature to an empty position
- no -> remove and rehash one of the other signatures.
Recursively apply until insertion has succeeded, or we have reached the MAX_RETRIES paramter. In the latter case, return failure.
How do collisions form in Cuckoo filters? How can we lower the false positive rate?
Two signatures of different numbers can be the same, hence when querying it is possible to get false-positives. They can be reduced by increasing the size of the signature (more bits) or by increasing the size of the cuckoo filter.
What are 6 the most important properties of FM-sketches?
- Single pass: each record examined at most once
- Small space: O(1/epsilon2 log 1/delta)
- Real-time: O(1/epsilon2 log 1/delta)
- Not delete-proof
- Composable: Bitwise OR of the bits is possible if the hash functions used to construct the bit arrays are the same.
- Idempotent: Repeated insertions do not change the synopsis.
(number of hash functions and bitarrays = 1/epsilon2 log 1/delta)
Provide the basic estimation technique (query) of distinct values when using FM sketches.
After the bit array has been created from the datastream, we find the first 0 from the least significant bit. Take the index of that element, and insert into function:
d = c * 2R
Where
- d is the estimation of distinct values of the stream
- R is the index of the first 0 from the LSB
- c is a scaling constant (e.g. 1.3)
Averaging many copies (with different hash functions) improves accuracy. With O(1/epsilon2 log 1/delta) copies, get (epsiolon, delta) approximation.
Explain how values are hashed in FM sketches.
FM sketches uses a hash function that maps a value to the bit array defined in the sketch. This hash function can be constructed from a uniform hash function that maps a value into a certain domain (e.g. [0..232]), then the position of the least significant bit is taken as the index for the bit vector. This means that in the binary string representation of the hashed value, we look at the index of the right-most bit that is set to 1. This index will be used for where we set the bit to 1 in the FM sketch. Example:
Uniform hash function h1(x) = (3x + 7 ) % 11
h1(5) = 10 = 0b1010
LSB = 1 (second from the right)
h1(6) = 3 = 0b11
LSB = 0
So our FM sketch looks like: …000000000011