W4 Vector Space Model Flashcards

1
Q
  1. What is Ranked Retrieval?
  2. Formalization Definition of Ranked Retrieval
A
  1. Also called Relevance Ranking. The system returns an ordering of the (top) documents in the collection for a query.
  2. For each query Qi and doc Dj, compute a score RSV(Qi, Dj)
    * RSV: Retrieval Status Value
    * RSV theory neural, could be distance based, probability based
    * Ordinal scale (supports > < =)
    * Value increase with relevance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

1 What’s set-based approach for Ranked Retrieval?

A

Query and doc are each a set of terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give the calculation of Jaccard Coefficient

A

jaccard (A,B) = |A ∩ B| / |A ∪ B|

|A ∩ B|: Commonality
|A ∪ B|: Length Normalization
A and B don’t have to be the same size
We always assigns a number from [0, 1]

RSV(Q,D) = jaccard(terms(Q), terms(D))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the query-doc match score computed by jaccard coefficient for each of the 2 docs: (common stop words are removed):

Query: Ides of March -> {ide, march}
Doc 1: Caesar died in March -> {caesar, die, march}
Doc 2: the long march -> {long, march}

A

jaccard (A,B) = |A ∩ B| / |A ∪ B|

jaccard ({ide, march}, {caesar, dy, march}) = 1 / 4
jaccard ({ide, march}, {long, march}) = 1 / 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Give calculation of:
Jaccard Distance?
Dice coefficient?

A

Jaccard Distance = 1 - Jaccard Similarity

Dice(A, B) = 2x|A ∩ B| / (|A| + |B| )=2J/(J+1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a metric?

A

In fact a distance function.
* identity: d(x, y)=0 => x=y
* Symmetry: d(x, y) = d(y, x)
* Triangle inequality: d(x, z) <= d(x, y) + d(y, z)
* Non negativity: d(x, y) >= 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the issues with Jaccard for scoring?

A
  1. we need to normalize for length (of the document or query)
  2. Jaccard does not consider term frequency
  3. Jaccard does not consider that rare tems in a collection are more informative than frequent terms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

2 What’s term-weighting approach for Ranked Retrieval?

A

Measure term importance in docs
Vector Space Model (VSM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is bag of words model?

A

multiset of words, we don’t care the ordering in a doc.
john loves mary
mary loves john
they have the same vectors
The positional index can distinguish two docs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is term frequency (TF)?

A

The term frequency tf (t,d) of term t in document d is defined as
the number of times that t occurs in d.

raw term frequency is not what we want because (Relevance does not increase proportionally with term frequency) so we use log normalization:

The log frequency weight of term t in d is:
w_(t, d) = 1 + log(tf) if tf>0
= 0 else
w is the quantitative evidence of d being relevant for query t

score a doc for a query: sum the w

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is document frequency (DF)? How do we interpret it?

A

df(t) is the document frequency of t: the number of
documents that contain t

Rare terms are more informative than frequent terms (eg. stop words and the word arachnocentric in a doc) A document containing arachnocentric is more likely to be relevant than a document that does not contain it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. What is Inverse Document Frequency (IDF)?
  2. How to calculate TF-IDF?
A
  1. idf = log (N/df)

N: number of docs in the collection
idf quantifies how surprised we are to see term t in a doc, again, logarithm is used for better approximation

  1. TF-IDF = TF * IDF
    TF-IDF increases with number of occurances within the doc, and with rarity of the term in the collection.
    TF-IDF is evidence of d being relevant when looking for t
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Document ranking in VSM

A

Docs are vectors of term weights in the space

The vector space is |V|-dimensional (|V|: no. of terms in the vocabulary)
terms are axes of the space

***** Vectors are High-dimensional, sparse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Queries as vectors too, how do we rank?

A
  • represent them as vectos in space, rank docs according to their proximity to the query in the space = similarity of the vectors

The angle between vectors shows the similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give formula for cosine similarity (dot product) of two vectors

A

cos(q, d) = (q * d) /(|q|*|d|)
= \sum (qi * di) / \sqrt (\sum qi^2) * \sqrt (\sum di^2)

q, d: length-normalized: ||x|| = \sqrt (\sum xi^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

@#@ LET OP @#@

The calculation in this chapter is very important and will present in the exam therefore please refer to the slides for examples and do the calculations yourself

A

tf, df, idf, TF-IDF, length normalization, dot product, cosine similarity

17
Q

Summary: Vector Space ranking with term weighting

A
  • represent each document and query as weighted tf-idf vector
  • compute cosine similarity score for the query vector and each document vector
  • rank documents with respect to the query by score
  • return top K to the user
18
Q

scope vs. verbosity hypothesis

A

scope: a long document consists of a number of unrelated short documents concatenated together

verbosity: a long document covers a similar scope to a short document, but uses more words

19
Q

Criterias for comparing IR models/systems

A

effectiveness, efficiency, explainability, parsimony (Occam’s Razor)