Week 3 Flashcards
Two methods to optimize TR system
Compression
- document indexing causes huge amount of memory
- lossless compress index as integers to save disk space and I/O
Caching
- Cache between FE and index on disk
- Save I / O
What is a Tokenizer
- represents doc in TR system
- segments raw string to count terms
Input - My name is Mustafa
Output - { my: 1, name: 1, is: 1, Mustafa: 1} - Assign term and doc IDs
What are tokenizer docs represented by
Vector features, process is called feature generation
What is Indexer
Only loads part of raw corpus in memory one at a time
Parsing query fast enough for a useable search engine
What is inverted index
Quick lookup of documents given query terms - make full text search possible
What is TF-IDF? Reduced what? Measures what?
Semantic representation method to reduce side effect of bag of words (TF)
- TF over emphasized word counts so high, frequency terms become over weighted semantic feature
- IDF measures a term role in entire set of documents
Tf(t,d) = fd (t) / max fd(w)
idf(t,D) = ln( |D| / {d E D: t E d})
Tfidf(t,d,D) = tf(t,d) x idf(t,D)
Fd (t) = frequency of term t in document d, D = corpus of documents
What is compression? What does it exploit? What does it support?
Compresses large posting file, compressed index is both smaller and faster.
Exploits the non uniform distribution of values
Supports random access decoding
Inverted index entries sorted by DocID
- DocID = {23,25,34,35,39,43}
- Compressed {23,2,9,1,4,4}
What is bitwise compression
Uses raw binary numbers with no fixed length or width.
Performs unary encoding ,gamma encoding or delta encoding
What is Block compression
Read bytes at a time rather than bits
Only one bitwise operation per byte is required
Reduce # of CPU instructions in deciding by using more memory
Decoder sum each byte and bytes are chained backwards.
What is caching
- make frequently access objects in memory ( posting table )
- Two goals
- most frequent terms are fast to access
- set a max size for cache
What is Zipf’s law
Relatively small number of postings lists will be accessed most of the time
Least recently used cache
When posting list for term ID x to be retrieved
- Search cache for term ID
- If exists, return posting list for ID
- If not, retrieve and add to cache
- If cache size exceeded, remove least recently used posting list
Double barrel LRU cache
Simplified approximation of LRU Cache
Use two hash tables - primary + secondary
When term ID x to be retrieved
1. Search in primary, if ID exists return
2. If not, search secondary
3.if in secondary, delete from secondary, insert to primary then return
3. If primary size exceeds, empty secondary table and swap tables
4. If not in secondary, retrieve from disc and insert into secondary
Scorer / Ranker
Design ranking function
Design ranking function, key challenge
F ( q,d )
Q - query , D - document
A good ranking function ranks relevant documents over non relevant ones
Key challenge - find a at times measure likelihood that document d is relevant to query q
Types of retrieval models
Boolean Model, similarity based model, probabilistic model, probabilistic inference model, axiomatic model
What is the Boolean model
Query and documents as a set of terms
Use Boolean expression to join and match terms
What is the similarity based model
F(q,d) - similarity (q,d)
Vector space model
Probabilistic models
Classic probabilistic
Language model
Divergence from randomness mode
What are common ideas of TR models
Tokenization, Term Frequency, Document Length, Document frequency.
Similarity based vector space model
Represents a doc/query by a term vector
A term denoted a word/phase/concept and defines one dimension, N terms define a N-dimensional space
Relevance of a query vector and a doc vector
Challenges of VSM
Matching key words with concrete meaning dwarves more credit
Matching common/auxiliary words should be downgraded
Solutions of VSM
Using term frequency TF weighting - more counts of key words have more weight
Adding inverse document frequency (IDF) to downgrade credit of common terms
What is TF transformation BM25
- Sublinear TF transformation to avoid BIG term dominance
- has upper bound to prevent the score rising to high