05 Term weighting Flashcards

1
Q

what is zipf’s distribution

A

a few words occur frequently

medium words occur mediumly (most useful and descriptive)

many words occur only once

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what are the contrasting assumptions for Best match approach or a co-ordinated search

A
  • term is present in document or not is binary
  • it does not consider the degree of association between the term and the document
    “relevancy”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the basis

A

determine word’s semantic utility based on its statistical properties

  • count word frequency in document / all document in the collection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is zipf’s law

A

rank (R) of a word * frequency is approximately a constant (K)

rank (R) * probability of the word occurrence is approximately empirical constant (A)

quite accurate except for very high rank or very low rank

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what are the consequences of zipf’s law

A

some frequent words are not good discriminators called stop words
remove them to reduce inverted index storage cost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is heaps law

A

V = size of vocabulary(number of unique words)
n = number of words

v = K * n**beta

typically
k 10 to 100
beta 0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is resolving power

A

measure of relevancy
2 critical factors
- word frequency within a document
- word frequency in collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is term weighting

A

effective approach to score documents
- how many query terms it contains
- how discriminative those terms are

not all terms are useful so we give them a weight

output W = weight of kth word in document
input
f = number of occurrence of kth word in document (term frequency)
N = number of documents in collection
D = number of documents containing the kth word

inverse document frequency idf =
log(N/D)

calculate TF-IDF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly