W3 Boolean Retrieval & Indexing Compression Flashcards

Question 1

Q

What is this table called?

Word 1 1 0 1 1
Word 2 1 1 0 0
Word 3 0 1 1 1
Word 4 1 0 0 1

Answer

A

Term-Document Incidence Matrix

Question 2

Q

Consider the following table representing the relevance scores of documents in a search result:

1 | 3
2 | 2
3 | 1
4 | 2

Calculate the Discounted Cumulative Gain (DCG) for this ranking.

Answer

A

To calculate the DCG, we apply the DCG formula, which sums up the relevance scores of the ranked documents discounted by the logarithm of their positions. Let’s calculate the DCG for the given table:

DCG = rel_1 + sum(rel_i / log2(i+1)) for i in range(2, n)

Calculating the DCG:

DCG = 3 + (2 / log2(2+1)) + (1 / log2(3+1)) + (2 / log2(4+1))
    = 3 + (2 / 1.585) + (1 / 2) + (2 / 2.322)
    =  3.5021231561

Therefore, the Discounted Cumulative Gain (DCG) for the given ranking is approximately 3.5021231561.

Question 3

Q

Recommend a query processing order for the query below, given
the postings list sizes in the table:
(rocket OR launch) AND (sky OR universe) AND (sun OR orbit)

Term | Postings size
rocket |81300
launch | 109999
sky |43123
universe | 48491
sun | 62687
orbit | 577513

Answer

A

rocket OR launch: 81300 + 109999 = 191299
sky OR universe: 43123 + 48491 = 91614
sun OR orbit: 62687 + 577513 = 640200
Process in increasing order of their frequency (sum)

Question 4

Q

Consider these documents:
Doc 1
new recipe for cookies
Doc 2
cookie dough recipe
Doc 3
new chocolate dough
Doc 4
recipe for chocolate cookies
a)
Draw the term
document incidence matrix for this
document collection
b)
Draw the inverted index representation for this collection.
c)
What is the result set for the query: recipe AND NOT
(chocolate OR

Answer

A

a) The term-document incidence matrix for this document collection can be represented as follows:

| Terms     | Doc 1 | Doc 2 | Doc 3 | Doc 4 |
|-----------|-------|-------|-------|-------|
| recipe    |   1   |   1   |   0   |   1   |
| cookies   |   1   |   0   |   0   |   1   |
| dough     |   0   |   1   |   1   |   0   |
| chocolate |   0   |   0   |   1   |   1   |
| new       |   1   |   0   |   1   |   0   |

In this matrix, each row represents a term, and each column represents a document. The value in each cell indicates the presence (1) or absence (0) of the term in the corresponding document.

b) The inverted index representation for this collection is as follows:

| Terms     | Documents       |
| recipe    | Doc 1, Doc 2, Doc 4 |
| cookies   | Doc 1, Doc 4    |
| dough     | Doc 2, Doc 3    |
| chocolate | Doc 3, Doc 4    |
| new       | Doc 1, Doc 3    |

In this representation, each row represents a term, and the corresponding column lists the documents in which the term appears.

c) The result set for the query “recipe AND NOT (chocolate OR dough)” would include documents that contain the term “recipe” but do not contain either “chocolate” or “dough.” Based on the inverted index representation, the result set would include Doc 1, as it contains “recipe” but does not contain either “chocolate” or “dough.” The other documents (Doc 2, Doc 3, and Doc 4) either contain “chocolate” or “dough” and therefore are not included in the result set.

Question 5

Q

Name 4 methods of compression (relevant for this exam)

Answer

A

Zipf’s law
Huffman encoding
Entropy
Posting list compression

Question 6

Q

What’s Zipf’s law?

Answer

A

The k-th most frequent term
has frequency proportional to 1/k

Main idea: store many small numbers and few larger numbers.

Question 7

Q

What is the equation for collection frequency in the context of Zipf’s law?

Answer

A

TODO: check if correct

CF = (Total number of occurrences of a term in the collection) / (Total number of terms in the collection)

This equation calculates the frequency of a specific term within a collection of documents. It represents the ratio of the total number of times the term appears in the collection to the total number of terms present in the collection. The collection frequency is used to analyze the distribution of term frequencies and their ranks according to Zipf’s law, which states that the frequency of any term is inversely proportional to its rank in the frequency table.

Question 8

Q

What is Huffman coding?

Answer

A

A frequency sorted binary
tree sorted from the bottom up.

Must watch if not clear (1 minute): https://youtu.be/JsTptu56GM8?t=220

Question 9

Q

Consider the following postings list for a term “apple”:

Postings list: [10, 15, 18, 22, 25, 28, 32]

Perform postings compression on the given list using delta encoding and variable-length encoding techniques.

Answer

A

TODO: check AI-generated answer
Step 1: Delta Encoding
Apply delta encoding to the postings list by calculating the gaps (differences) between consecutive document identifiers:

Postings list: [10, 15, 18, 22, 25, 28, 32]
Gaps: [10, 5, 3, 4, 3, 3, 4]

Step 2: Variable-Length Encoding
Encode the gaps using variable-length encoding, where smaller values are represented using fewer bits.

Gaps: [10, 5, 3, 4, 3, 3, 4]
Encoded Gaps: [1010, 101, 11, 100, 11, 11, 100]

In this example, the variable-length encoded gaps are represented using a binary format, with each gap converted to its binary equivalent.

The final compressed postings list using delta encoding and variable-length encoding is:
Compressed Postings: [1010, 101, 11, 100, 11, 11, 100]

Brainscape's Knowledge GenomeTM

W3 Boolean Retrieval & Indexing Compression Flashcards

Brainscape's Knowledge Genome^TM