adv-db Flashcards

1
Q

Boolean retrieval model

A

connectives: and, or, not, m of n
fuzzy matching: “database” matches “databases”
Thesaurus expansion: expand query into synonyms
stop-word-elimination: eliminate “the”, “is”, “with” (filler words)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Zipf’s law

A

Zipf’s Law is a statistical distribution in certain data sets, such as words in a linguistic corpus, in which the frequencies of certain words are inversely proportional to their ranks.

he word in the position n appears 1/n^(omega) times as often as the most frequent one. Where omega ~ 1.5 or 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Precision

A

fraction of retrieved docs that are relevant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Recall

A

Fraction of relevant docs that are retrieved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

f1 measure

A

2P * R / P + R

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Jaccard coefficient

A

Measure of overlap between two sets
jaccard(A, B) = |A intersection B | / |A union B|; if A=B, then jaccard(A,B) = 1
Issues:
* Overaccount for long texts
* Does not account for term frequency
* Does not consider that rare terms may be more informative than frequent terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Binary term-document “Incidence Matrix”

A

Assign 0 or 1 if the term appears in the given document and create a matrix of form:

                Doc1     Doc2 term 1         1             0 term 2         0            1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Term-document “Count Matrix”

A

Assign count of term (term frequency) in the document and create a matrix of form:

                Doc1     Doc2 term 1         73             3 term 2         0            44
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Bag of words model

A

Vector representation doesn’t consider ordering of words in documents. Such vectors, are non positional representations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Term frequency (tf)

A

frequency of term t in document d, defined as the number of times that t occurs in d
- Relevance does not grow linearly with term frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Log-term-frequency weighting

A

w = {1 + log_10(tf)} if tf > 0 otherwise 0
Then the score for a document query pair is just the sum of the weights for each term in the query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Inverse document frequency (idf)

A

df_t: number of documents in the database that contain term t
idf_t: inverse document frequency of t in the database -> idf_t = log_10(N/df_t); where N is the number of documents in the database

We use the log function to ‘dampen’ the effect of idf

Each idf is database-dependent: each term has one value for the whole DB

No effect on queries with only 1 term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Collection frequency

A

Number of occurrences of t in the collection, counting multiple occurrences within documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

tf-idf weighting

A

tf-idf = tf * idf
tf-idf_t,d = (1 + log_10(tf_t,d)) * log_10(N / df_t); where N is the number

For a query, sum the tf-idf of each term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Vector length normalization

A

Divide vector by it’s L2 norm
L2 norm –> ||x||_2 = sqrt(sum(x^2))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cosine similarity

A

cos(q, d) = Sum(q_i * d_i) / sqrt( sum(q_i^2) * sum(d_i^2) )

Where:
q_i = tf-idf of term i in query q
d_i = tf-idf of term i in document d

16
Q

Cosine similarity

A

cos(q, d) = Sum(q_i * d_i) / sqrt( sum(q_i^2) * sum(d_i^2) )

Where:
q_i = tf-idf of term i in query q
d_i = tf-idf of term i in document d

17
Q

Rocchio’s algorithm

A

Given a query q, q_0 initial vector representation for q

q_t+1 = alpha * q_t + beta * sum(d_i for d_i in R) / |R| - gamma * sum(d_i for d_i in NR) / |NR|

Where:
d_i = vector representation of the document
R = relevant documents for q
NR: non relevant documents for q

Essentially we have a query vector of the form:
alpha. 1
beta 1
gamma 1

And we add the average tf-idf of each term given the relevant documents and then subtract the average tf-idf of each term of the non relevant documents

18
Q

Inverted files

A

Term_1: d1, d2, d3
Use hash maps to store efficiently (i.e., key is term, value is list of documents)
- Sort by the terms

19
Q

Champion list

A

identify, for each term t, the r documents with the highest weights for t (offline step).
The set of r documents is the ‘champion list’
When the query comes, make the collection the union of the r documents for each query term

20
Q

Page rank

A

Probability matrix:
P = (alpha/N + (1 - alpha) * 1/L_i for L_i in Nodes)

Where:
N = Number of nodes
L_i = number of outgoing edges from Node i to another node

Pij is the probability of j being the next state given that random surfer is currently in state i

Query independent -> static quality score

21
Q

Hubs & authorities

A

Query dependent alternative to PageRank

Authorities for query q are pages that have highly relevant content for q and are authoritative

Hubs for q are pages that are good directories of resources for q (i.e., pages that point to many good authorities for q)

22
Q

Snowball

A