Information Retrieval Flashcards by Malini Nair

indexing

task of finding terms that describe the documents well

How well did you know this?

Not at all

Perfectly

manual indexing

done by using a predefined set of index terms and fixed vocabularies
the indexing is done by humans
labour intensive
they are high precision searches and work well for closed collections, however, searches need to know terms to achieve precision and labellers need to be trained in order to achieve consistency

How well did you know this?

Not at all

Perfectly

text retrieval

find documents that are relevant to a user query given a large static document collection and information needed

How well did you know this?

Not at all

Perfectly

automatic indexing

uses natural language as indexing language, implementation of indices done via inverted files. It also consists of term manipulation and term weighting

How well did you know this?

Not at all

Perfectly

inverted file index

it can be used to record in which document a term occurs, how many occurrences, and the position of those occurrences

How well did you know this?

Not at all

Perfectly

bag-of-words approach

only records what terms are present and their occurrence. ignores the relationship between words i.e. ordering

How well did you know this?

Not at all

Perfectly

boolean model

the boolean query constructs complex search commands by combining basic search terms with Boolean operators.
precise simple, logical basis for deciding whether any document should be returned based on whether the basic terms of query do not appear in the document and the meaning of the logical operators.

How well did you know this?

Not at all

Perfectly

vector space model

uses the bag-of-words approach where documents = points in high dimensional vector space
dimension = term in an index
frequencies of terms in documents = values
queries are represented as vectors

How well did you know this?

Not at all

Perfectly

method to perform vector space model

select documents(s) with the highest document-query similarity (model for relevance => ranking_
number of documents returns => less relevant thus uses start at the top of ranking stop when satisfied

How well did you know this?

Not at all

Perfectly

normalized correlation coefficient

the cosine of the angles between the vectors

vector pointing in the same direction: 1
orthogonal vector: 0
vectors pointing in opposite directions: -1

This computes how well occurrences of each term i correlate in query and document, then scales for the magnitude of the overall vectors

How well did you know this?

Not at all

Perfectly

term manipulation

the pre-process of terms for generalization

How well did you know this?

Not at all

Perfectly

tokenization

process of removing punctuation

How well did you know this?

Not at all

Perfectly

capitalisation

normalise all words to lower/upper case

How well did you know this?

Not at all

Perfectly

lemmatisation

conflate different inflected forms of a word to their basic form (singular, present tense, 1st tense)

How well did you know this?

Not at all

Perfectly

stemming

conflate morphological variants by chopping their affix (connected, connection -> connect)

How well did you know this?

Not at all

Perfectly

normalisation

Study These Flashcards

heuristics to conflate variants due to spelling, hyphenation, spaces,

stop list

Study These Flashcards

removes non-content words (most frequency and least useful for retrieval

bigram indexing

Study These Flashcards

store each bigram as a term in index

i..e pease porridge in the pot => pease porridge, porridge in, in the, the pot

position indexing

Study These Flashcards

identifies multi word phrase during retrieval by storing position terms in documents

term weighting

Study These Flashcards

document collection
size of collection
term frequency
collection frequency
document frequency

inverse document frequency

Study These Flashcards

log(size of collection/documents containing the term)

tf.idf

Study These Flashcards

common weighting method which is the some of the term frequency and idf

PageRank Algorithm

Study These Flashcards

exploits the link structure of the web
- link from page A to page B confers authority on B depending on the PageRank score of A and its number of outgoing links (recursively defined)

recall

Study These Flashcards

the proportion of relevant documents returned

relevant & retrieved documents / all relevant

precision

the proportion of retrieved documents that are relevant | relevant & retrieved / all retrieved

f measure (f1)

combines precision & recall into a single figure to give equal weight to both

precision at cut off

measures how well a method ranks relevant documents before non relevant documents

average precision

precision computed for each point a relevant documents is found and figures averaged

Information Retrieval Flashcards

(28 cards)