Information retrieval Flashcards
What is the task of IR systems
Finding results that are similar to query
What is the difference of searching in IR vs database
a database result will always give you an exact match whereas IR systems will retrieve documents that are similar
Different retrieval techniques dependent on search query
non-textual objects = Meta description
Content - Bag of words
Semantic tagging = what is the meaning of the specieis of text
Link analysis - indicates importance of document by incoming links (authority)
which types of retrieval models are they
Boolean and vector space
Explain the boolean retrieval model
Each document is a bag of words, the user designs a boolean query where he or she can tell the search system in more detail what and how to search. Query contains boolean operator: And, Or, Not. Can only filter not sort
Explain the extended boolean retrieval model
The searcher has more control of the search process, the model consider text structure and distance between word when it matches the query to a piece of text. No rankning just sorting
Explain vector space retrieval model
A Vector is defined by its lenght and direction. Only coordinates are necessary to identify a vector length thanks to pythagoras sats
What is Bag of words
A set of ordered words in a document where the frequency of each word is indicated
The structure of the text is lost
common retrieval models
- similarity between document vectors
- term weight (measuring importance of word)
- evaluation of retrieval
What is the purpose of term vector similarity
finding similarity between document vectors, where the query is more than a few words.
Explain term vector similarity between documents
Term-document vector space -
documents are represented as vectors in a n-dimensional space where each dimension/axis is a term/word and each vector coordinate is the weight of the term in the document. So if a word is present in a document it will get the value 1 (if binary) and so be on spot 1 for that axis. So the direction is the determined by the words in the text and the length is dependent of the amount of words in the document (not important)
How is similarity measured?
- common terms = straight forward, count the nr of terms that q and d have in common
- scalar product = multiply the coordinates of the vector (x1x2 + Y1Y2) in order to get the lenght, it is normalized based on the amount of words in the document.
- Cosine similarity - similariy between 2 documents calculated as fucntion of the angle between the term vectors of these documents.
how is term weight used
measuring the importance of a word, instead of binary outcome, a higher number means a more important term
term frequency - how often the term appears in a document
inverted document frequency - how unique the term is in the collection of documents. Low IDF not unique, high - unique. total nr of doc/nr of doc containing the term.
Term weigth = frequency * inverted document frequency
high if frequent in a document and is unique for a subset of documents. Document and collection specific
How is retrieval evaluated
Precision and recall
What is precision
Fraction of retrieved documents that are relevant. How many relevant documents did we manage to retrieve?
What is recall
Fraction of relevant documents that retrieved. Out of all the relevant documents, how many did we manage to retrieve?
Precision and recall curve (interpolated)
Is like an average precision and recall curve, we obtain this by finding the larget measured precision value for all the recall values equal or larger (more to the right) of the given/standard recall and plot it on the Y-axis (precision) and we get interpolated precision value. We can summerize a lot of curves like this. It alway drops
Why does high recall means low precision and vice versa
Because, if we wish to get a higher recall we need to retrieve more documents in order to collect the majority of relevant documents, this will affect the precision since we will get more junk (unrelevant documents). If we wish to increase precision we need to retrieve less relevant documents but this will also mean we will miss more relevant documents and this will affect recall.
What is pooling and why do we do it?
Pooling is in order to manually determine relevance of a document, thus be able to evaluate retrieval. We do this by creating a small subset of documents that hopefully, but not necessarily contains all the relevant documents. So that we can manually verify the relevance and then define the total nr of relevant document in order to calculate recall. (t.ex. de 3000 första doc, hur många är relevanta och hur många fångar vi upp i de första 5 resultaten p)
How do we do query expansion
we add term synonyms to the original query, change the term weight in the original query in order to retrieve more relevant document, those that the original query missed out.
How does relevance feedback work and in particular - explicity relevance feedback carried out
Relevance feedback is either manually and then called explicit relevance feedback: the user indicates which retrieved document are relevant to query. The system modify the original query including the term that are represented in the newly relevant documents and the user submits the query to the system.
Otherwise it can be carried out inactively by user behavior analysis (links clicked, time spent)
what is document indexing
Transforming unstructured data into structured data. Ordered list of words. in order to speed up the process by searching in structured data that represents the document collection by information of which terms are present in which documents, how frequent and location in the document. This is done offline by crawlers. When you search a search engine it will perform text similariy by document index where each word is a dimension..
why is link anlysis important for search engines
A page that recieves a lot of links from pages in the same topic is an authority of the topic. Authorities are important pages. A link is like an vote from some page, many means high importance and it’s not easly faked.
What is page rank
A PR show how important the page is in term of the nr of incomming links and the importance of the linking pages.
Authority of a link count
Inbound links create weights –> outbound links transfer weights