InIndexes Implementation Flashcards
(11 cards)
How can we use an array to represent an index dictionary?
Create a list of structs sorted by the term name. Compress the dictionary and postings list.
Binary search to look up the term and then follow the pointer to the list
What are the main 2 methods for implementing a dictionary for an index?
- Hashtables
- Search trees
How does a hash table work for an index?
Each term is hashed to an integer key which is used to determine the spot in the array where the key-value pair will be stored
What are the pros and cons of using hash tables for indexes?
Pros:
1. Lookup is faster than the search tree methof
Cons:
1. There is no easy way to find minor variants
2. Cannot perform prefix search
3. If the vocabulary grows, you have to do an expensive rehashing operation to avoid collisions
What are some methods for dealing with hash table collisions?
- Linear probing
- Quadratic probing
- Double hashing
- Separate chaining
What is separate chaining?
A method of avoiding hash collisions where any time there is a collision, a new entry is added to a linked list at that spot. This is how the inverted index functions
What are some issues associated with search trees?
- Not all search trees are balanced so some nodes will require near O(n) access time.
- Record deletions may leave some nodes in the tree nearly empty which results in performance deterioration
Why is it important to have a balanced tree?
Guarantees nodes will be at a high level. Limits the disk accesses necessary for search
What is a B-Tree?
A “balanced tree”. It ensures that the tree is always balanced and space wasted by deletions never becomes excessive
How many disk accesses does searching require for a B Tree?
Successful search: costs at most h+1 disk accesses for level h tree
Unsuccessful search: Always requires h disk accesses
What are the pros and cons of using a B-Tree for an index?
Pros:
1. Solves the prefix search problem
Cons:
1. Slower O(log N)