Chapter 6 Retrieval Model Flashcards

1
Q

two main information retrieval mode

A

vector space retrieval model and probabilistic model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

4 major retrieval models

A

pivoted length normalization,
Okapi BM25
query likelihood with JM smoothing,
PL2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Common form of a Retrieval Function

A
First, these models are all based on the assumption of using a bag-of-words representation of text
Term frequency(TF), document length, and document frequency(DF) capture some of the main ideas used in pretty much all state-of-art retrieval models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

vector space retrieval model

A

the VS model is a framework. In this framework, we make some assumptions. One assumption is that we represent each document and query by a term vector. Here, a term can be any basic concept such as a word, a phrase or any other feature representation.

Each term is assumed to define one dimension. Since we have |V| terms in our vocabulary, we define a |V| dimensional space.

We place all documents in our collection in this vector space and they will be pointing to all kinds of directions. We can place our query in this space as another vector. We can then measure the similarity between the query vector and every document vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how to instantiate a vector space model so that we can get a very specific ranking function.

A

bag-of-words instantiation, we use each word in our vocabulary to define a dimension
1: present
0: absent
Sim(q,d) = q.d = x1y1+x2y2+….xN*yN

Now we can finally implement this ranking function using a programming language and then rank documents in our corpus given a particular query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Behavior of the Bit Vector Representation

A

The bit vector scoring function counts the number of unique query terms matched in each document. If a document matches more unique query terms, then the document will be assumed to be more relevant. The only problem is that three are tied with the same score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

improved instantiation

A

A natural thought is to consider multiple occurrences of a term in a document as opposed to binary representation
TF(w,d) = count(w,d)

Thus, the corresponding dimension would be weighted as two instead of one, and the score for d4 is higher. This means, by using TF, we can now rank d4 above d2 and d3 as we had hoped to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

stop word

A

About doesn’t carry that much content, so we should be able to ignore it. We call such a word a stop word.
They are generally very frequent and they occur everywhere such that matching it doesn’t have any significance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

inverse document frequency(IDF)

A

we want to reward a word that doesn’t occur in many documents.
We can penalize common words which generally have a low IDF and reward informative words that have a highter IDF.

IDF(w) = (M+1)/df(w)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly