Week 3 Flashcards

1
Q

Information retrieval

A

Finding unstructured documents that satisfy an information need within large collections

Technologies for document indexing, search and retrieval
- e.g. search engines

Not only google or bing
- specialised search

Not only English

IR is not only text
- speech question answering - Hey Siri
- music retrieval - Shazam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Query

A

What the user conveys to the computer in an attempt to communicate the information need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unstructured Data

A

In contrast to rigidly structured DBs, no obvious or easy-for-computer to deal with structure
- Could be some structure
– E.g. titles, paragraphs, HTML, XML, etc…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Information need

A

topic about which the user desires to address (search for) on the web

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Navigational Need

A

Query to take the user to a page (~10% of queries)
- typically company/business/org name
- Domain suffix
- Length of query < 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Transactional need

A

~10% of queries focus on doing something on the web (e.g. buying, downloading, booking, etc…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Informational need

A

Collect/retrieve/obtain some information
Use of question words, many phrases
Neither navigational nor transactional
Query length > 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When is a document relevant?

A

If the user perceives that it contains information of value with respect to their information need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Measuring document relevance

A

Does it satisfy the users need

Has the user clicked on it?
How long did they spend on it?
Have they clicked on another?
Have they reformulated the query?
Is this a new query?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Representing queries

A

Keywords to model information need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Representing documents

A

Reduce each document to a set of index terms
- ‘meaning’ of document approximated by index terms

What should be in the index?
Words? Names? Dates? Set of pre-defined tags?

Search is then focused on the index
- Map user queries to the index i.e. determine the similarity of a query to a document via index terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Indexing process

A

Acquire documents through crawling, feeds, deposits, etc…

Generate (compute) an index for each document

Store the index in an easy-to-search form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Index Search (retrieval) process

A
  • Assist the user in formulating the query
  • Transform the query (“index” it)
  • Map the query’s representation against the index
  • Retrieve and rank the results
  • Help users browse the results (and re-formulate the query)
  • Log user’s actions, etc…
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to choose index terms

A

“(most) important words?”?
Using titles? Words in bold? Quotes?
What (if any) background knowledge did you use?
What about combining or structuring words?
Have you though about possible queries?

Have you “ranked” your words?
- How to decide on importance/relevance
- more generic vs more specific words

How about stop words? (the,and,of,it,…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Term-document frequency matrix

A

rows of terms, columns of documents, each cell is the frequency of a term in the document

Sparse, massive matrices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Term-document incidence matrix

A

rows of terms, columns of documents, each cell is 1 if the term is in the document or 0 otherwise

Sparse, massive matrices

17
Q

Inverted Index

A

For each term t, store all the documents that contain t, good to store frequency and position too.

Sorted by terms first, then docID

18
Q

Boolean Queries

A

Simplest query model: find all document from the collection that fully match the query
- Binary outcome for each document yes/no

Use operators:
AND
OR
NOT

Can combine operators e.g. AND NOT

Many systems hide the operators (e.g. space is used as AND)

Pros:
- Simple model: everything is considered as a set of terms
- Easy/efficient to implement
- Precise (document either matches or doesn’t match)
- Widely used for commercial, legal retrieval and for specialist searches (long, precise queries)
- Works well when we know what we want: user feels in control

Cons:
- Users are not good in formulating queries: possible confusion with natural language
- Feast or famine: AND gives too few results, OR gives too many
- No relevance ordering / ranking of results
- Ignores word order, word frequency, etc…
- Documents that are ‘close’ are not retrieved

19
Q

Boolean Queries AND Merge

A

Term1 AND Term2
Find Term1 and Term2 in the dictionary

Merge the results so that you are left with the documents that occur in both

Crucial postings are sorted by docID to do this merge efficiently

20
Q

Extended boolean model

A

Proximity operators
Embed term positions in the inverted index (proximity index)

Queries can then refer to:
/n - (e.g. /3 within 3 words)
/s - in the same sentence, /p in the same paragraph
+s = term1 must proceed term2 in same sentence

21
Q

Ranked retrieval

A

Introduce similarity and ranking of matched documents
- Score each document to say how well it matches a query
- Typically aim to get top 10 ones correct (user typically will not look further)

Idea: Use vector representation for both documents and queries, and calculate their similarity

22
Q

Vector representations

A

Represent both documents and queries as vectors in the same space

Rank (all) documents according to their proximity in that given space

Do this to get away from the either-you’re-in-or-out model

23
Q

Term frequency

A

tf or term t and document d is the number of times t appears in document d

Relevance does not increase proportionally with term frequency

24
Q

Document frequency

A

In how many documents a term appears

Rare terms are more informative and discriminative than frequent terms for information retrieval

25
Q

Collection frequency

A

The number of occurrences of t in the collection (counting multiple occurrences in the same document)

With equal collection frequency, the term with lower document frequency is the more informative “better” term

26
Q

Inverse document Frequency

A

idf = log(N/df) where N is the number of documents

Based of the log is immaterial, log is used to dampen the effect

27
Q

Tf-Idf

A

Term frequency vs Inverse Document Frequency

tf - the number of times a term appears in a document
idf - Log(N/df) where N is the number of documents
df - The number of documents a term appears in

Increases
- With the number of occurrences within a document
- With the rarity of the term in the collection

28
Q

tf-idf document ranking for a query

A

Score of a document d is sum over all tf-idf weights of each query term t found in d

Indexing: For each term and document calculate tf*idf

Typical output: A list of documents ranked according to score (q,d)

29
Q

What’s wrong with VSMs for information retrieval

A

Vector space models work reasonably well, but:
- Based on bag of words, so ignore context

  • Heuristic in nature
    – Not notion of relevance, similarity only
    – Weights seem to work - mostly engineering (rather than theory)
  • Don’t adapt to the user or collection
    – Term weights should be user and domain specific
    – How to “train” on a particular collection

Does not model uncertainty

30
Q

PageRank

A

See Page Rank slides

31
Q

Adapt ranking to user clicks?

A

Higher positions receive more user attention (eye fixation) and clicks than lower positions

This is true even in the setting that the order of positions is reversed

They are “Informative but biased”

32
Q

Document relevant feedback

A

User feeds back on relevance of docs in an initial set of results
- User issues a (short, simple) query
- The user marks some results as relevant or non-relevant
- The system computes a better representation of information need based on feedback
- Relevant feedback can go through one or more iterations

Idea: It may be difficult to formulate a good query when you don’t the collection well, so iterate

We can modify the query based on relevance feedback and apply a standard vector space model

Relevance feedback can import precision and recall.

It is most useful for improving recall in situations where recall is important (users can be expected to review results and to take time to iterate)

Problems:
- Users are often reluctant to give explicit feedback
- Often hard to understand why a particular document was retrieved after applying relevance feedback

Long queries are inefficient for typical IR engine
- Long response times for user
- High cost for retrieval system
- Partial solution
– Only rewrite certain prominent terms
— Perhaps top 20 by term frequency

33
Q

Query Expansion

A

In relevance feedback, users give additional input (relevant / non-relevant) on documents which is used to rewrite terms in the documents

In query expansion, users give additional input (good/bad search terms) on words or phrases

Add more words of the same (or more specific) meaning to your query for better retrieval

Could use a thesaurus, automatically derived thesaurus. Refinements based on query log mining (common on the web)

34
Q

Thesaurus based query expansion

A

For each term t in a query, expand the query with synonyms and related words of t from a thesaurus

feline -> feline cat

May weight added terms less than original query terms

Generally increases recall but may significantly decrease precision, particularly with ambiguous terms.
“interest rate” -> “interest rate fascinate evaluate”