Week 3 Flashcards
Information retrieval
Finding unstructured documents that satisfy an information need within large collections
Technologies for document indexing, search and retrieval
- e.g. search engines
Not only google or bing
- specialised search
Not only English
IR is not only text
- speech question answering - Hey Siri
- music retrieval - Shazam
Query
What the user conveys to the computer in an attempt to communicate the information need
Unstructured Data
In contrast to rigidly structured DBs, no obvious or easy-for-computer to deal with structure
- Could be some structure
– E.g. titles, paragraphs, HTML, XML, etc…
Information need
topic about which the user desires to address (search for) on the web
Navigational Need
Query to take the user to a page (~10% of queries)
- typically company/business/org name
- Domain suffix
- Length of query < 3
Transactional need
~10% of queries focus on doing something on the web (e.g. buying, downloading, booking, etc…)
Informational need
Collect/retrieve/obtain some information
Use of question words, many phrases
Neither navigational nor transactional
Query length > 2
When is a document relevant?
If the user perceives that it contains information of value with respect to their information need
Measuring document relevance
Does it satisfy the users need
Has the user clicked on it?
How long did they spend on it?
Have they clicked on another?
Have they reformulated the query?
Is this a new query?
Representing queries
Keywords to model information need
Representing documents
Reduce each document to a set of index terms
- ‘meaning’ of document approximated by index terms
What should be in the index?
Words? Names? Dates? Set of pre-defined tags?
Search is then focused on the index
- Map user queries to the index i.e. determine the similarity of a query to a document via index terms
Indexing process
Acquire documents through crawling, feeds, deposits, etc…
Generate (compute) an index for each document
Store the index in an easy-to-search form
Index Search (retrieval) process
- Assist the user in formulating the query
- Transform the query (“index” it)
- Map the query’s representation against the index
- Retrieve and rank the results
- Help users browse the results (and re-formulate the query)
- Log user’s actions, etc…
How to choose index terms
“(most) important words?”?
Using titles? Words in bold? Quotes?
What (if any) background knowledge did you use?
What about combining or structuring words?
Have you though about possible queries?
Have you “ranked” your words?
- How to decide on importance/relevance
- more generic vs more specific words
How about stop words? (the,and,of,it,…)
Term-document frequency matrix
rows of terms, columns of documents, each cell is the frequency of a term in the document
Sparse, massive matrices
Term-document incidence matrix
rows of terms, columns of documents, each cell is 1 if the term is in the document or 0 otherwise
Sparse, massive matrices
Inverted Index
For each term t, store all the documents that contain t, good to store frequency and position too.
Sorted by terms first, then docID
Boolean Queries
Simplest query model: find all document from the collection that fully match the query
- Binary outcome for each document yes/no
Use operators:
AND
OR
NOT
Can combine operators e.g. AND NOT
Many systems hide the operators (e.g. space is used as AND)
Pros:
- Simple model: everything is considered as a set of terms
- Easy/efficient to implement
- Precise (document either matches or doesn’t match)
- Widely used for commercial, legal retrieval and for specialist searches (long, precise queries)
- Works well when we know what we want: user feels in control
Cons:
- Users are not good in formulating queries: possible confusion with natural language
- Feast or famine: AND gives too few results, OR gives too many
- No relevance ordering / ranking of results
- Ignores word order, word frequency, etc…
- Documents that are ‘close’ are not retrieved
Boolean Queries AND Merge
Term1 AND Term2
Find Term1 and Term2 in the dictionary
Merge the results so that you are left with the documents that occur in both
Crucial postings are sorted by docID to do this merge efficiently
Extended boolean model
Proximity operators
Embed term positions in the inverted index (proximity index)
Queries can then refer to:
/n - (e.g. /3 within 3 words)
/s - in the same sentence, /p in the same paragraph
+s = term1 must proceed term2 in same sentence
Ranked retrieval
Introduce similarity and ranking of matched documents
- Score each document to say how well it matches a query
- Typically aim to get top 10 ones correct (user typically will not look further)
Idea: Use vector representation for both documents and queries, and calculate their similarity
Vector representations
Represent both documents and queries as vectors in the same space
Rank (all) documents according to their proximity in that given space
Do this to get away from the either-you’re-in-or-out model
Term frequency
tf or term t and document d is the number of times t appears in document d
Relevance does not increase proportionally with term frequency
Document frequency
In how many documents a term appears
Rare terms are more informative and discriminative than frequent terms for information retrieval