Information Retrieval Flashcards
(28 cards)
indexing
task of finding terms that describe the documents well
manual indexing
done by using a predefined set of index terms and fixed vocabularies
the indexing is done by humans
labour intensive
they are high precision searches and work well for closed collections, however, searches need to know terms to achieve precision and labellers need to be trained in order to achieve consistency
text retrieval
find documents that are relevant to a user query given a large static document collection and information needed
automatic indexing
uses natural language as indexing language, implementation of indices done via inverted files. It also consists of term manipulation and term weighting
inverted file index
it can be used to record in which document a term occurs, how many occurrences, and the position of those occurrences
bag-of-words approach
only records what terms are present and their occurrence. ignores the relationship between words i.e. ordering
boolean model
the boolean query constructs complex search commands by combining basic search terms with Boolean operators.
precise simple, logical basis for deciding whether any document should be returned based on whether the basic terms of query do not appear in the document and the meaning of the logical operators.
vector space model
uses the bag-of-words approach where documents = points in high dimensional vector space
dimension = term in an index
frequencies of terms in documents = values
queries are represented as vectors
method to perform vector space model
- select documents(s) with the highest document-query similarity (model for relevance => ranking_
- number of documents returns => less relevant thus uses start at the top of ranking stop when satisfied
normalized correlation coefficient
the cosine of the angles between the vectors
- vector pointing in the same direction: 1
- orthogonal vector: 0
- vectors pointing in opposite directions: -1
This computes how well occurrences of each term i correlate in query and document, then scales for the magnitude of the overall vectors
term manipulation
the pre-process of terms for generalization
tokenization
process of removing punctuation
capitalisation
normalise all words to lower/upper case
lemmatisation
conflate different inflected forms of a word to their basic form (singular, present tense, 1st tense)
stemming
conflate morphological variants by chopping their affix (connected, connection -> connect)
normalisation
heuristics to conflate variants due to spelling, hyphenation, spaces,
stop list
removes non-content words (most frequency and least useful for retrieval
bigram indexing
store each bigram as a term in index
i..e pease porridge in the pot => pease porridge, porridge in, in the, the pot
position indexing
identifies multi word phrase during retrieval by storing position terms in documents
term weighting
- document collection
- size of collection
- term frequency
- collection frequency
- document frequency
inverse document frequency
log(size of collection/documents containing the term)
tf.idf
common weighting method which is the some of the term frequency and idf
PageRank Algorithm
exploits the link structure of the web
- link from page A to page B confers authority on B depending on the PageRank score of A and its number of outgoing links (recursively defined)
recall
the proportion of relevant documents returned
relevant & retrieved documents / all relevant