- represents doc in TR system - segments raw string to count terms Input - My name is Mustafa Output - { my: 1, name: 1, is: 1, Mustafa: 1} - Assign term and doc IDs

- make frequently access objects in memory ( posting table ) - Two goals - most frequent terms are fast to access - set a max size for cache

Week 3 Flashcards by mustafa mohsin

Two methods to optimize TR system

Compression

document indexing causes huge amount of memory
lossless compress index as integers to save disk space and I/O

Caching

Cache between FE and index on disk
Save I / O

How well did you know this?

Not at all

Perfectly

What is a Tokenizer

represents doc in TR system
segments raw string to count terms
Input - My name is Mustafa
Output - { my: 1, name: 1, is: 1, Mustafa: 1}
Assign term and doc IDs

How well did you know this?

Not at all

Perfectly

What are tokenizer docs represented by

Vector features, process is called feature generation

How well did you know this?

Not at all

Perfectly

What is Indexer

Only loads part of raw corpus in memory one at a time

Parsing query fast enough for a useable search engine

How well did you know this?

Not at all

Perfectly

What is inverted index

Quick lookup of documents given query terms - make full text search possible

How well did you know this?

Not at all

Perfectly

What is TF-IDF? Reduced what? Measures what?

Semantic representation method to reduce side effect of bag of words (TF)

TF over emphasized word counts so high, frequency terms become over weighted semantic feature
IDF measures a term role in entire set of documents

Tf(t,d) = fd (t) / max fd(w)
idf(t,D) = ln( |D| / {d E D: t E d})
Tfidf(t,d,D) = tf(t,d) x idf(t,D)
Fd (t) = frequency of term t in document d, D = corpus of documents

How well did you know this?

Not at all

Perfectly

What is compression? What does it exploit? What does it support?

Compresses large posting file, compressed index is both smaller and faster.
Exploits the non uniform distribution of values
Supports random access decoding
Inverted index entries sorted by DocID
- DocID = {23,25,34,35,39,43}
- Compressed {23,2,9,1,4,4}

How well did you know this?

Not at all

Perfectly

What is bitwise compression

Uses raw binary numbers with no fixed length or width.

Performs unary encoding ,gamma encoding or delta encoding

How well did you know this?

Not at all

Perfectly

What is Block compression

Read bytes at a time rather than bits
Only one bitwise operation per byte is required
Reduce # of CPU instructions in deciding by using more memory
Decoder sum each byte and bytes are chained backwards.

How well did you know this?

Not at all

Perfectly

What is caching

make frequently access objects in memory ( posting table )
Two goals
most frequent terms are fast to access
set a max size for cache

How well did you know this?

Not at all

Perfectly

What is Zipf’s law

Relatively small number of postings lists will be accessed most of the time

How well did you know this?

Not at all

Perfectly

Least recently used cache

When posting list for term ID x to be retrieved

Search cache for term ID
If exists, return posting list for ID
If not, retrieve and add to cache
If cache size exceeded, remove least recently used posting list

How well did you know this?

Not at all

Perfectly

Double barrel LRU cache

Simplified approximation of LRU Cache
Use two hash tables - primary + secondary
When term ID x to be retrieved
1. Search in primary, if ID exists return
2. If not, search secondary
3.if in secondary, delete from secondary, insert to primary then return
3. If primary size exceeds, empty secondary table and swap tables
4. If not in secondary, retrieve from disc and insert into secondary

How well did you know this?

Not at all

Perfectly

Scorer / Ranker

Design ranking function

How well did you know this?

Not at all

Perfectly

Design ranking function, key challenge

F ( q,d )
Q - query , D - document
A good ranking function ranks relevant documents over non relevant ones

Key challenge - find a at times measure likelihood that document d is relevant to query q

How well did you know this?

Not at all

Perfectly

Types of retrieval models

Study These Flashcards

Boolean Model, similarity based model, probabilistic model, probabilistic inference model, axiomatic model

What is the Boolean model

Study These Flashcards

Query and documents as a set of terms

Use Boolean expression to join and match terms

What is the similarity based model

Study These Flashcards

F(q,d) - similarity (q,d)

Vector space model

Probabilistic models

Study These Flashcards

Classic probabilistic
Language model
Divergence from randomness mode

What are common ideas of TR models

Study These Flashcards

Tokenization, Term Frequency, Document Length, Document frequency.

Similarity based vector space model

Study These Flashcards

Represents a doc/query by a term vector
A term denoted a word/phase/concept and defines one dimension, N terms define a N-dimensional space
Relevance of a query vector and a doc vector

Challenges of VSM

Study These Flashcards

Matching key words with concrete meaning dwarves more credit

Matching common/auxiliary words should be downgraded

Solutions of VSM

Study These Flashcards

Using term frequency TF weighting - more counts of key words have more weight

Adding inverse document frequency (IDF) to downgrade credit of common terms

What is TF transformation BM25

Study These Flashcards