Approximation Flashcards

1
Q

Bloom Filter Interface

A

probabilistic with one-sided errors:
* if the key is not in the set, contains will return false
(no false negatives)
* if the key is in the set, contains may return true or false
(false positives are possible)
10 bits for 1% false positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Multiple Hash Functions

A

number of elements in set: n
* filter size (in bits): m
* number of hash functions: k
* insert sets all k bits to 1, remove not possible
* contains returns true if all k bits are 1, false otherwise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Blocked Bloom Filter

A
  • low false positive rates require larger k
  • this increases number of memory accesses
  • idea: hash-partition filter and perform only one access
  • cache line blocks of 64 bytes (512 bits)
  • register blocks of 8 bytes (64 bits)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Count-Distinct Estimation: Basic Idea

A
  • How many distinct values are there in a sequence (or stream) of keys?
  • hash each key and count the number of consecutive leading zeros (0011011 -> 2)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Count-Distinct Estimation: problem and solution

A

two problems:
* using only one maximum leads to estimates with high variance
* a single outlier that happens to have many leading zeros will lead to severe
overestimation
solution:
* use multiple sketches (maxima), partitioned by hash

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Succinct data structure: space consumption

A

space bound L
2L is compact
L + √L is succinct
L + 3 is implicit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Level-Ordered Unary Degree Sequence (LOUDS)

A
  • stores nodes in breadth-first order
  • encodes each node’s degree (# of children) using unary code
  • no pointers, the tree is one bit string
  • LOUDS needs 2 bits per node (2n)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Fast Succinct Trie (FST)

A
  • span of 8 bits (maximum fanout is 256)
  • nodes are encoded in level order (like LOUDS)
  • two encodings: LOUDS-Dense and LOUDS-Sparse
  • space-efficient and reasonable performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

FST: LOUDS-Dense

A

four arrays:
* 256-bit sequence D-Labels: branch present?
* 256-bit sequence D-HasChild: child or leaf?
* 256-bit sequence D-IsPrefixKey: is the key a prefix?
* sequence for the values (D-Values)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

FST: LOUDS-Sparse

A

four arrays:
* byte sequence S-Labels: records all the branching labels for each trie node
* bit sequence S-HasChild: inner or leaf node?
* bit sequence S-LOUDS: 1 bit for each label, indicates node boundaries
* sequence for the values (S-Values)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Learned Indexes

A
  • idea: use statistical model (or ML) to predict location of sorted data
  • models cumulative distribution function (CDF) of keys
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  • what three parameters n, m and k determine the false positive rate of a
    bloom filter?
A

k: number of hash functions
n: number of elements in set
m: filter size in bits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  • how can we improve the cache locality of a bloom filter?
A

blocked bloom filter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Learned indexes: issues

A
  • bad cache locality: every lookup activates entire model
  • “last mile”: predicts approx. location well, but getting error down more
    requires more effort
How well did you know this?
1
Not at all
2
3
4
5
Perfectly