W1 Intro Flashcards

1
Q

What’s Information Retrieval?

A

Information retrieval (IR) is finding material (usually documents) of
an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).

Key words:
Collection of unstructured documents
User
information need
relevance
query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Generally what does IR system do?

A
  1. Analyzing queries & docs
  2. Retrieve docs - compute relevance score of (query, doc)
  3. ranking docs by score
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Comparison between Search Engine and recommender system?

A

Search Engines:
* User has information need
* User types query
* System returns docs ranked by relevance to the query (based on matcg/popularity)

Recommender System:
* User has interest and goes to certain platforms
* System returns items relevant to the user (based on profile/popularity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

4 principles of IR

A

Principles of…

  1. Relevance
  2. Ranking
  3. Text Processing
  4. User Interaction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s Two-stage Retrieval?

A

Query -> Initial Retrieval from full collection -> Re-ranking top-n results with trained re-ranker

1st stage:
from large collection, unsupervised, term-based (sparse)
priority is RECALL

2nd stage:
rank top-n doc from 1st stage, supervised, embedding-based (dense)
priority is PRECISION

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

1) What’s the most used term-based retrieval model?
2) What mechanism does it use?
3) What’s the problem of the model?

A

1) BM25
2) exact match, term weighting (tf-idf)
3) BM25 is based on exact match, Searchers sometimes use different terms to describe their information needs than what authors of the relevant documents used (vocabulary mismatch)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to solve exact term matching problem?

A

Use semantic matching

Semantic matching model are embedding-based (low dimension, dense vector representation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Relevance factors: how do you measure relevance?

A
  1. Term overlap: return docs containing the query terms
  2. Doc importance: PageRank
  3. Result popularity: clicks by other users
  4. Diversification: different types of results
  5. Semantic similarity: docs with semantic representation close to the query
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Comparison between Index Time and Query Time?

A

Index Time:
* Collecti new docs
* Pre-process docs
* create doc representation
* store docs in index
* index can take time

Query Time:
* process user query
* match query to index
* retrieve docs potentially relevant
* rank docs by relevance score
* Needs to be real-time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to process documents in term-based retrieval model?

A
  • split doc & query in terms
  • Term normalization:
    lowercase, remove diacritics, stop words, stem/lemma, map similar words together

All word forms that have a place in the index are called terms
All terms together are the vocabulary of the index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What’s the role of users in IR?
What are possible challanges when users are involved?

A
  • User has information need
  • User defines what’s relevant
  • User interacts with the search engine

Challenges:
* The real user need is hidden
* User queries are short & ambiguous
* Natural language is unstructured, noisy, sparse, multilingual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Fill in the blanks:

The most widely used application of ranking is web search. In web search, a user enters a ____(1)____ and the search engine returns a list of results. The results are retrieved from an ____(2)____. The result page of the search engine shows a list of short descriptions of the documents. These short descriptions are called ____(3)____.

A

The most widely used application of ranking is web search. In web search, a user enters a (1) query and
the search engine returns a list of results. The results are retrieved from an (2) index. The result page of
the search engine shows a list of short descriptions of the documents. These short descriptions are called
(3) snippets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Fill in the blanks:
The results are ranked by their ____(4)____ as estimated by the search engine. An important part of this estimation is exact term matching. The most successful and most used scoring function for exact term matching is ____(5)____. It is still commonly used as initial ranking model, both in commercial and academic contexts.

A

The results are ranked by their (4) relevance as estimated by the search engine. An important part of this
estimation is exact term matching. The most successful and most used scoring function for exact term
matching is (5) BM25. It is still commonly used as initial ranking model, both in commercial and academic
contexts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Fill in the blanks:
One shortcoming of exact matching is the ____(6)____ mismatch problem, the problem that a relevant document uses different words than the query and is not retrieved. For that reason, exact matching models are commonly combined with “soft” matching or ____(7)____ matching models. These models are based on dense, continuous vector representations called ____(8)____.

A

One shortcoming of exact matching is the (6) vocabulary mismatch problem, the problem that a relevant
document uses different words than the query and is not retrieved. For that reason, exact matching
models are commonly combined with “soft” matching or (7) semantic matching models. These models are
based on dense, continuous vector representations called (8) embeddings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Fill in the blanks:

The common architecture for ranking is a two-stage approach: in the first stage, an exact matching model is used to retrieve an initial set of items, and in the second stage these items are re-ranked with a less efficient but more effective ranker. While first-stage rankers are generally ____(9)____ (with only a few free hyperparameters to tune), the second stage rankers are ____(10)____, trained on data with relevance labels.

A

The common architecture for ranking is a two-stage approach: in the first stage, an exact matching model
is used to retrieve an initial set of items, and in the second stage these items are re-ranked with a less
efficient but more effective ranker. While first-stage rankers are generally (9) unsupervised (with only a
few free hyperparameters to tune), the second stage rankers are (10) supervised, trained on data with
relevance labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Fill in the blanks:

The common architecture for ranking is a two-stage approach: in the first stage, an exact matching model is used to retrieve an initial set of items, and in the second stage these items are re-ranked with a less efficient but more effective ranker. While first-stage rankers are generally ____(9)____ (with only a few free hyperparameters to tune), the second stage rankers are ____(10)____, trained on data with relevance labels.

A

The common architecture for ranking is a two-stage approach: in the first stage, an exact-matching model
is used to retrieve an initial set of items, and in the second stage these items are re-ranked with a less
efficient but more effective ranker. While first-stage rankers are generally (9) unsupervised (with only a
few free hyperparameters to tune), the second stage rankers are (10) supervised, trained on data with
relevance labels.