Search Trees Flashcards
(37 cards)
Ordered Data Sets
hashing stores items in random order
data should be ordered by key to support insert, remove, lookup
operations: lower bound, upper bound, range scan
Standard Binary Search (arrays)
logn steps if array is sorted
0.5 branch misses per comparison on avg for random searches
access pattern is not cache friendly
on large arrays is slower than in B-tree
Branch-Free Binary Search
conditional branch can be transformed to data dependency
no branch misses for small arrays
Interpolation Search
assumes keys are uniformly distributed(distance between Elems is same/close to it) 10 13 15 26 28 30
a single, very large or very small outlier will destroy interpolation
can be combined with binary search
Linear Search
O(n) steps
can be improved using SIMD instructions
good locality
array doesn’t have to be sorted for point search
best solution for small arrays
Blocking Layout
store hot keys together in blocks
recursive structure
improves locality
allows use of SIMD instructions (k-ary search)
Growing Arrays
better locality than linked lists
one downsize is growth because of limited size
standard approach: exponential growth plus copy
Growing Arrays without Copying
64 pointers pointing to array of values
good locality
Comparison-based Search Trees (binary search trees)
compare one elem at time
each comparison rules out half the elems (if tree is balanced)
log2n comparisons and height
Binary Search Trees
each node stores one key/value pair and two child pointers
complex rebalancing logic
bad cache locality
pointer based: elem doesnt have to be moved physically during rebalancing
Adv:
if you insert an element, you just allocate a new entry and point into tree
if tree rebalanced: no need to copy, you just change pointer structure of the tree
Skip List
elems in sorted linked list + additional shortcuts on top
probabilistic data structure (no worst case guarantees)
all items are stored in a sorted order by the key
B+-tree
variant of B tree where inner nodes only store pointers, no values
leaf nodes are often chained to simplify range scan
Sorted Arry
worst case for n elems:
lookup log2n
insert n
Sorted Blocks of Fixed Size
split array into small sorted arrays of b elems, b is fixed
Inner Nodes
indicate path to leaf node that has actual data
tree height: logb(n)
comparisons: log2n
B-tree logic decomposition
two parts:
1. structure-modification operations (split, merge, new root) determine shape of B tree
2. per-node operations (lower bound, insert, delete, iterate) work on nodes locally
(2) can be changed without affecting high level logic (1)
Physical Node Layouts
most common layout: array of sorted key/value pairs
binary search within each node
insert/delete moves half of elems on avg(to the right)
alternatives: store keys and values separately, unsorted leaf nodes but sorted immer nodes, encode binary search tree, trie or hash table within node
what is the best node size?
in main memory, node size (b) can be freely chosen
node size should be at least cache line size
optimal page size for lookup is bigger that for inserts
B trees with variable-length data
bad cache locality, because need ti dereference the pointer which generally will not be in cache
lots of memory allocation
memmory fragmentation from memory allocator
Slotted page layout
enables variable-lenght data in sorted order
slots decouple logical from physical ordering
slots store offset and length
Compaction
deletion in slotted page leaves holes
keep holes after deletion, but keep track of how many bytes are wasted
when running out of space, one can trigger compaction if its beneficial
best implemented by copying all keys to new empty page
Separator Choice
when splitting node: choose shortest key around median as separator
Suffix Truncation of Separator
keys in inner nodes are only used as signposts, do not have to be stored in leaf nodes
if separator is ABCDE and next key is AXYZ, separator can be truncated to AX
Prefix Truncation, its difficulty
extract common prefix between smallest and largest key on node
on lookup, compare prefix first
speeds up comparison and saves space
issue: insertion of key may cause prefix to shrink -> all keys need to be enlarged, may not fit into node anymore, causing a split
solution: store fence keys on each page