C1 Flashcards
(19 cards)
information retrieval
finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections
how do IR systems support the search process?
- analyzing queries and documents
- retrieving documents by computing a relevance score for each document given a query
- ranking the documents by that relevance score (true relevance is typically binary)
how does a search engine work? (basic)
- user has an information need
- user types a query
- system returns item sorted by relevance to that query, based on match on the query and popularity
how do recommender systems work? (basic)
- user has an interest
- user goes to a service/app/website
- system provides items that are relevant to the user based on user history/profile and popularity
4 principles of IR
principles of
- relevance
- ranking
- text processing
- user interaction
principles of relevance
- term overlap: return documents that contain the query terms (or related terms?)
- document importance (PageRank)
- result popularity: clicks by previous users on the document
- diversification: different types of results for different interpretations of the query
evaluation of relevance
needed:
- a set of queries
- a document collection
- relevance assessment: set of documents for each query that are labelled as relevant or non-relevant
the retrieval system returns a ranking of all documents
principles of ranking
- estimate relevance for each document
- we need a score
- term weighting (basic notion)
- PageRank
machine learning for ranking
Idea: learn the relevance of a document based on human-labelled training data (relevance assessments)
Why is machine learning for ranking different from machine learning for classification?
Relevance depends on the query, so we cannot train a global classifier over all relevant and irrelevant documents in a labelled dataset
=> need a machine learning paradigm that includes the query
two-stage retrieval first stage
- from large collection
- unsupervised
- often term-based (sparse)
- priority: recall
two stage retrieval second stage
- ranking top-n documents from first stage
- supervised
- often based on embeddings (dense)
- priority: precision at high ranks
2 different relavance models for similarity
- term overlap: find documents that contain query words
- semantic similarity: find documents which semantic representation is close to the query
index time
- collect (new) documents
- pre-process the documents
- create the documents representation
- store the documents in the index
- indexing can take time
query time
- process the user query
- match the query to the index
- retrieve the documents that are potentially relevant
- rank the documents by relevance score
- retrieval cannot take time (< 1 sec)
term-based retrieval models
- split documents and queries into terms
- normalize terms to a common form (lowercase, remove diacritics and stop words, stemming/lemmatization, map similar terms together)
- most used model is BM25
problem with term-based retrieval
a vocabulary mismatch between query and document
solution: semantic matching, based on embeddings representations of texts
the role of users in IR
- user has an information need
- user defines what is relevant
- user interacts with search engine
challenges with user interaction
- real user is hidden
- user queries are short and ambiguous (what does the query refer to? what should be the mode of the answer? what type of information is the user interested in?)
- natural text is unstructured, noisy, redundant, infinite, sparse and multilingual