03 Architecture of retrieval system Flashcards

1
Q

relevance

A

task statement
build a system that retrieves documents that users are likely to find relevant to their queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

saracevics relevance**

A

relevance is the measure of a correspondence existing between a document and a query as determined by the user

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

relevance in practice

A
  • most models use statistical properties of text rather than linguistic
  • focus on topical relevance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

text representation

A

bags of words
- treat all words in a document as index terms
- assign a weight to each term based on importance
- disregard structure, meaning of word

assumptions
- term occurrence is independent
- document relevance is independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

document acquisition

A

accumulate text by web crawls
convert html, pdf to plain text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

lexical analysis (tokenisation)

A

the process of converting stream of characters into stream of words
- identify words
- recognise spaces
- treating digits, hyphens, punctuations, case of letters

eg.
1999 vs 510B.C
state-of-the-art
list.id
Bank vs bank

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

elimination of stop words

A

words which are too frequent among documents in the collection are not good discriminators

very low discrimination values

can be important in combinations
- to be or not to be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

strategies for stopword removal

A
  • list look up: stop word list
  • usage of frequency: information from other documents
  • frequency analysis: terms occurring in 80% of documents

reduces size of indexing structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

conflation

A

expectation for system to be robust, plural forms should not affect
reduces word variants into a single form
stemming is a specific conflation technique

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

stemming

A

reduces all words with same root into a single root

SESS -> 1SS
(AEIOU)ED -> 1
(AEIOU)Y -> 1

increases retrieval of all possibly relevant documents
reduces index size
problem:
prevent interpretation of meaning (gravitation vs gravity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly