Hash Tables Flashcards

1
Q

Direct Adressing

A

keys are just integers -> allocate array as large as the number of keys you have -> random access memory. lookup, insert, remove O(1)
Problem: huge waste of memory if key domain is larger than the set of keys stored
best approach for small and dense key ranges

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hash function and hash table, desired properties

A

Hash function h maps keys from domain U to position in hashable of size m
Desired properties: fast hash computation, small probability of two keys mapping to the same hash value -> collision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how to handle collision?

A

With chaining
each position in hash table has linked list of entries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how large should m (chin length) be?

A

must be constant for average constent time access. must be proportional to n.
m > f * n
f: fill factor parameter, typically close to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

if final size of hash table is not known, how to handle it?

A

it must grow dynamically(rehashing)
1. create new, larger table
2. iterate over entries in old table, and insert into new hash table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how large should new table be?

A

to keep amortized insertion time constant, must grow exponentially by a factor g > 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Smooth growth without rehashing

A

linear hashing or extendible hashing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what makes a good hash function?

A

has function should look random but be deterministic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Modulo prime hash function/fast modulo

A

modulo uses integer division, which is expensive
division can be replaced with multiplicative inverse by magic value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Alternative to prime hashing, trick to avoid modulo or magic numbers

A

Powers of two
most real-world hash functions compute a 23 bit or 64 bit hash
to be used in a hash table, must compute h(x) % p

trick:
restrict m to powers of two (implies g = 2)
modulo becomes simple as fast bit shift or mask

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

randomized properties of hash functions

A

Universal hashing: two distinct keys collide with probability 1/m or less
Pairwise independence: probability that two distinct keys have same hash is 1/m2
K-independence: probability that any k distinct keys have some particular hash values is 1/m
k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Tabulation hashing, adv and disadv

A

break up key of length r * t into small fixed-size chunks of r bits
for each chink look up in table with random (2`r * t) bits
XOR lookup results
example with 32 bit key x and 8 bit chunks:
h(x) = t[0][x0] XOR t[1][x1] XOR t[2][x2] XOR t[3][x3]

Adv: 3-independent, could be implemented efficiently in hardware
disadv: consumes CPU cache, random state grows with key size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Open Addressing

A

Chaining has at least two dependent memory accesses for hit
storing entries directly in hash table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how to handle collisions in open adressing?

A

Open addressing with linear probing
on collision, insert into next free position
pos = (pos + 1) % m
if m is power of 2 => cheaper
lookup must check entries until free entry is found
fill factor f must be below 1 for good performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

at high fill factors, OA leads to long probe sequence lengths (clustering effect)
what is a way to get better space utilization?

A

have fixed-size groups of entries (e.g each group can have four 16-byte entries)
typically a group has size of cache line
may require a counter for each group
SIMD can be used to speed up comparison with several elems (to compare your key agains all keys in group with one instruction)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Open Addressing plus Chaining

A

store first elem in array
chain on collision

17
Q

Chaining vs Open Addressing

A

different size entries
growth: rehashing operation faster with OA
chaining: lots of cache misses as items are spread in memory
duplicate keys: chain length will grow but no problem
high fill factors: chaining is better, no exploding prob sequence length, but has overhead through pointers. OA doesnt have pointers
Entry copying:
large entries _< if hashtable grows, with OA copy everything
in chaining, move them from old hashtable to new, no copying

18
Q

linear probing leads to clusters of entries and therefore long probe sequences. What is a solution?

A

Robin Hood hashing
re-order elements during insertion based on their probe sequence length
(so that elements with long probe sequence length swap with elem that has a shorter one)
similar performance as linear probing for small fill factors
works better that linear probing with larger fill factors

19
Q

What is probe sequence length?

A

Distance of key from its preferred location

20
Q

Cuckoo Hashing

A

Open addressing approach that guarantees O(1) lookup and remove performance in the worst case
idea: two hash tables with different hash functions
lookup: check exactly 2 positions
insert: pick any of the 2 positions, if 1 free, put it there. If both spots occupied, try to rearrange keys
rearrangement may fail due to cycle: rebuild using different hash function
only works for low fill actors
can be fast: the memory accesses are independent

21
Q

cuckoo hashing: insert(z)

A

z hashes exactly to locations of x and y. We have to put it in one of those locations. -> kick out y as y has also another location that may be free (as each key has two possible spots)
if other spot for y is also occupied -> continue, kick out key that occupied it -> you can get into cycle

22
Q

Static perfect hashing

A

FKS hashing: O(1) worst case lookup time and amortized O(1) insertion time
bulk creation, two levels: (looks like a tree)
1. 2(n-1) partitions (n: number of elements)
2. each partition is a hash table of size r^2 for r entries in partition, on collision pick another hash function

problem: large space consumption, two dependent cache misses

23
Q

Concise Hash Table: idea, construction

A

static hash table optimized for space
idea: open addressing with linear probing, but compress away empty slots

Construction: three phases, assume unique keys
1. hash key, go t that bit and set it. if bit set -> collision, go to next bit
2. compute prefix popcounts for bitmap
3. hash each key again and insert into array using prefix popcount

24
Q

should hashes be stored or not?

A

has functions can have collisions. for correctness, it must store full keys and compare on lookup
hash can always be recomputed from key - > no need to store the hash
but if you store it: speeds up rehashing (if you have cache misses, it could be faster to compare against the hash than against the key), can speed up negative comparison for complex keys

25
Q

Table growth: cache locality

A

in power of two tables, doubling means we use one more hash bit
two possibilities:
least significant bit: 01 may become 001 or 101
most significant bit: 01 may become 010 or 011
MSB: front bits stay the same, you get more resolution-> access pattern is better