Week 3 Flashcards

Question

Collection frequency

Answer 1

The number of occurrences of t in the collection (counting multiple occurrences in the same document) With equal collection frequency, the term with lower document frequency is the more informative "better" term

Answer 2

idf = log(N/df) where N is the number of documents Based of the log is immaterial, log is used to dampen the effect

Answer 3

Term frequency vs Inverse Document Frequency tf - the number of times a term appears in a document idf - Log(N/df) where N is the number of documents df - The number of documents a term appears in Increases - With the number of occurrences within a document - With the rarity of the term in the collection

Answer 4

Score of a document d is sum over all tf-idf weights of each query term t found in d Indexing: For each term and document calculate tf*idf Typical output: A list of documents ranked according to score (q,d)

Answer 5

Vector space models work reasonably well, but: - Based on bag of words, so ignore context - Heuristic in nature -- Not notion of relevance, similarity only -- Weights seem to work - mostly engineering (rather than theory) - Don't adapt to the user or collection -- Term weights should be user and domain specific -- How to "train" on a particular collection Does not model uncertainty

Answer 6

See Page Rank slides

Answer 7

Higher positions receive more user attention (eye fixation) and clicks than lower positions This is true even in the setting that the order of positions is reversed They are "Informative but biased"

Answer 8

User feeds back on relevance of docs in an initial set of results - User issues a (short, simple) query - The user marks some results as relevant or non-relevant - The system computes a better representation of information need based on feedback - Relevant feedback can go through one or more iterations Idea: It may be difficult to formulate a good query when you don't the collection well, so iterate We can modify the query based on relevance feedback and apply a standard vector space model Relevance feedback can import precision and recall. It is most useful for improving recall in situations where recall is important (users can be expected to review results and to take time to iterate) Problems: - Users are often reluctant to give explicit feedback - Often hard to understand why a particular document was retrieved after applying relevance feedback Long queries are inefficient for typical IR engine - Long response times for user - High cost for retrieval system - Partial solution -- Only rewrite certain prominent terms --- Perhaps top 20 by term frequency

Answer 9

In relevance feedback, users give additional input (relevant / non-relevant) on documents which is used to rewrite terms in the documents In query expansion, users give additional input (good/bad search terms) on words or phrases Add more words of the same (or more specific) meaning to your query for better retrieval Could use a thesaurus, automatically derived thesaurus. Refinements based on query log mining (common on the web)

Answer 10

For each term t in a query, expand the query with synonyms and related words of t from a thesaurus feline -> feline cat May weight added terms less than original query terms Generally increases recall but may significantly decrease precision, particularly with ambiguous terms. "interest rate" -> "interest rate fascinate evaluate"

Week 3 Flashcards

(34 cards)