Week 3 - Information Retrieval Flashcards

1
Q

What is information retrieval

A

It is finding material (usually documents) of an unstructured nature that satisfied an information need from within large collections

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the definition of a search

A

It is a conversation between user and search engine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When your search result is not good, what does this tell you?

A

You have entered an ambiguous query or the query is irrelevant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is user intent important

A

Billions of users, everyone has a different idea of what they are looking for based on the same search term. Search engines conduct localisation to get better at INTENT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can we abstract the IR of a search problem?

A

Input —> Process —> Output
Input is a query in a string of characters
output is a list of characters (Article name, link, article id)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is a search problem considered ad hoc retrieval

A

Because we cannot anticipate all the queries upfront

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is relevance also ad hoc?

A

Because it is tied to the information need and the need is very specific

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is it called a TERM-DOCUMENT matrix

A

Terms do not always correspond with words. It can be numbers, symbols, abbreviations etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A passage from Hamlet has been converted into 1s and 0s, what sort of IR can you carry out

A

Simple queries, like searching for the presence of a word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Boolean retrieval

A

Query is posed in the form of a Boolean expression of terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is tokenization

A

Tokenization is the process of splitting the sentence into words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is normalisation

A

So that people can search for plural,singular versions of the word, ways of spelling (US/UK), sentence capitalisation etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is stemming

A

Chopping off the word: “jumping” —> “jump”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is lemmatisation?

A

You maintain a dictionary. Dictionary matching: “run” from the word “ran”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose behind pre-processing?

A

To make the query easier for the user. If not the user must search for the EXACT word form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When does “double quotes” feature?

A

When you are looking for a specific variance. For example, “RUN”

17
Q

Which is better? Stemming or lemmatisation?

A

Lemmatisation is often slower than stemming. Because you minded to maintain a dictionary. And if new words come up, you need to update your lemmatisation dictionary

18
Q

What are stop word?

A

Filters out the commonly used words (a, an, by, or). They go beyond conjunctions and filter out words that have no real meaning. They are being removed from the query first.

19
Q

What does TFIDF mean

A

Term frequency - inverse document frequency

20
Q

How does TFIDF work?

A

It prioritises rare terms in the collection, across documents. If your document has a high frequency, it has high TF-IDF

21
Q

Vector space model, how is it used in IR?

A

Each vector represents the document. Given that, which vectors are the most similar. Calculate the vectors similarity. the query will be represented as a vector as well.

22
Q

Precision definition

A

What fraction of the returned results are relevant to the information needed

23
Q

Recall definition

A

What fraction of the relevant documents in the collection were returned by the system

24
Q

When should you focus on precision or recall?

A

Recall: if you want to do an exhaustive search, covering all relevant ground.
Precision: when you are more focused on results

25
Q

There is always a trade off between precision and recall. What is the trade off?

A

The trade off involves threshold and cut off in scoring. This might cause some terms to slip through

26
Q

When do you have perfect precision and recall

A

If you perfectly capture unwed intent

27
Q

Search engine data structures. What is an index?

A

An index has all your key terms and pages in which the terms occur

28
Q

what is an inverted index?

A

For each term t, store list of docIDs (unique) that contain t

29
Q

What is a term dictionary

A

Set of terms in the dictionary forms the vocabulary

30
Q

How do you have an inverted index?

A

Dictionary + postings = inverted index