C1 Flashcards

Question 1

Q

information retrieval

Answer

A

finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections

Question 2

Q

how do IR systems support the search process?

Answer

A

analyzing queries and documents
retrieving documents by computing a relevance score for each document given a query
ranking the documents by that relevance score (true relevance is typically binary)

Question 3

Q

how does a search engine work? (basic)

Answer

A

user has an information need
user types a query
system returns item sorted by relevance to that query, based on match on the query and popularity

Question 4

Q

how do recommender systems work? (basic)

Answer

A

user has an interest
user goes to a service/app/website
system provides items that are relevant to the user based on user history/profile and popularity

Question 5

Q

4 principles of IR

Answer

A

principles of
- relevance
- ranking
- text processing
- user interaction

Question 6

Q

principles of relevance

Answer

A

term overlap: return documents that contain the query terms (or related terms?)
document importance (PageRank)
result popularity: clicks by previous users on the document
diversification: different types of results for different interpretations of the query

Question 7

Q

evaluation of relevance

Answer

A

needed:
- a set of queries
- a document collection
- relevance assessment: set of documents for each query that are labelled as relevant or non-relevant

the retrieval system returns a ranking of all documents

Question 8

Q

principles of ranking

Answer

A

estimate relevance for each document
we need a score
term weighting (basic notion)
PageRank

Question 9

Q

machine learning for ranking

Answer

A

Idea: learn the relevance of a document based on human-labelled training data (relevance assessments)

Question 10

Q

Why is machine learning for ranking different from machine learning for classification?

Answer

A

Relevance depends on the query, so we cannot train a global classifier over all relevant and irrelevant documents in a labelled dataset
=> need a machine learning paradigm that includes the query

Question 11

Q

two-stage retrieval first stage

Answer

A

from large collection
unsupervised
often term-based (sparse)
priority: recall

Question 12

Q

two stage retrieval second stage

Answer

A

ranking top-n documents from first stage
supervised
often based on embeddings (dense)
priority: precision at high ranks

Question 13

Q

2 different relavance models for similarity

Answer

A

term overlap: find documents that contain query words
semantic similarity: find documents which semantic representation is close to the query

Question 14

Q

index time

Answer

A

collect (new) documents
pre-process the documents
create the documents representation
store the documents in the index
indexing can take time

Question 15

Q

query time

Answer

A

process the user query
match the query to the index
retrieve the documents that are potentially relevant
rank the documents by relevance score
retrieval cannot take time (< 1 sec)

Question 16

Q

term-based retrieval models

Answer

Study These Flashcards

A

split documents and queries into terms
normalize terms to a common form (lowercase, remove diacritics and stop words, stemming/lemmatization, map similar terms together)
most used model is BM25

Question 17

Q

problem with term-based retrieval

Answer

Study These Flashcards

A

a vocabulary mismatch between query and document
solution: semantic matching, based on embeddings representations of texts

Question 18

Q

the role of users in IR

Answer

Study These Flashcards

A

user has an information need
user defines what is relevant
user interacts with search engine

Question 19

Q

challenges with user interaction

Answer

Study These Flashcards

A

real user is hidden
user queries are short and ambiguous (what does the query refer to? what should be the mode of the answer? what type of information is the user interested in?)
natural text is unstructured, noisy, redundant, infinite, sparse and multilingual

C1 Flashcards

(19 cards)