Search Engines and Crawlers Flashcards

(47 cards)

1
Q

Vertical search

A

More limited than web search (e.g. only certain file formats will be shown)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Enterprise search

A

Searching for company documentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Desktop search

A

Searching for data inside files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classification

A

Compares documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ad-hoc search

A

Searching through unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Relevance in information retrieval

A

o Topical relevance and user relevance
o Retrieval models – how the results will be presented (e.g. a list of links)
o Ranking algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Evaluation in information retrieval

A

o Precision and recall – when multiple users search for the same thing, they should be shown the same result
o Test collections
o Clickthrough and log data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Information needs

A

o Query suggestions (e.g. autofill)
o Query expansion – providing other potentially relevant data
o Relevance feedback (e.g. showing how many times an academic paper has been cited)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the primary goals of a search engine?

A
  1. Effectiveness – retrieving the most relevant set of documents possible
  2. Efficiency – processing queries as quickly as possible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Issues in search engines

A

o Performance - efficient searching and indexing
o Incorporating new data
o Scalability – growing with data and users (e.g. handling large amounts of traffic)
o Adaptability – tuning for applications (e.g. adapting for use on a variety of devices)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Document statistics

A

Gathering and recording statistical information about words and documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How are document statistics used?

A

The gathered information is stored in lookup tables and used by ranking algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Lookup table

A

Data structure designed for quick retrieval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Weighting

A

Calculating weight using document statistics and storing it in a lookup table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

tf.idf weighting

A

Giving high weights to terms that appear in very few documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or false: Weight is calculated during the query process

A

False! It can be calculated as part of the query process, but calculating during indexing makes querying more efficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Inversion

A

Changing document-term info into term-document info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Methods of query transformation

A

o Spell checking
o Query suggestion
o Suggesting additional terms via query expansion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When search results are displayed, snippets are generated to…

A

o Summarise retrieved documents
o Identify related groups of documents
o Highlight important words and passages

20
Q

Document data store

A

A database that manages large numbers of documents and the structured data (usually metadata or links) associated with them

21
Q

True or false: A document data store can be stored in a relational database

A

True, but some applications use other storage systems for faster retrieval

22
Q

Scoring

A

Calculating scores for documents using ranking

23
Q

Performance optimisation

A

Designing ranking algorithms to decrease response time and increase query throughput

24
Q

Methods of distributing ranked documents

A

o Query broker

o Caching results of common search queries

25
Parser
Processes sequences of text tokens (usually words)
26
Stopping
Removes common words from text tokens to reduce repetition and index size
27
Stemming
Grouping words that have a similar meaning (e.g. "fish" and "fishing") and replacing them with a designated word
28
Why is stemming used?
Words in queries and documents are more likely to match
29
Information extraction
Using syntactic analysis to identify complex index terms
30
Classifier
o Identify class-related metadata o Assign labels to documents representing topic categories o Group documents without pre-defined categories
31
Crawler
Follows links to web pages to discover and download new pages
32
Web crawling
* Client program connects to domain name system (DNS) server * DNS server translates host name into an internet protocol (IP) address * Program attempts to connect to computer with that IP address * Once connection is established, client program sends a HTTP request to the server
33
Politeness policies
Standards that aim to reduce a crawler's impact on a web server's performance. This could include a politeness window (time between requesting pages) being prevented from accessing certain pages
34
Focused crawling
Relies on web pages linking to other pages on the same topic
35
How does focused crawling work?
* Can use popular pages as seeds * Use text classifiers to determine what page is about * If page is on topic, keeps the page and uses its links to find related sites
36
How does a focused crawler decide which pages to visit next?
* Tracking topicality of downloaded pages to determine whether to download similar pages * Anchor text data and topicality data can be combined to determine which pages to visit next
37
Deep web
Pages that are difficult for a crawler to find
38
Examples of deep web pages
* Sites that require an account * Form results (e.g. flight timetables, product search) * Scripted pages
39
How can web pages become easier to find?
Sitemaps
40
Issues in crawlers and how to solve them
* Text documents stored in incompatible formats - can be converted to tagged formats * Crawling can be expensive in terms of CPU and network load - may be useful to store documents
41
BigTable
A distributed database system in which the table is split into small pieces (tablets), which are served by thousands of machines
42
How are changes recorded in BigTable?
Changes are recorded in a transaction log and stored in a shared file system
43
True or false: If a BigTable tablet server crashes, the whole table is inaccessible
False! Another server will immediately read the tablet data and transaction log and take over
44
True or false: BigTable is a relational database model
False! Unlike relational databases, not all rows have the same columns.
45
Feed
Real time stream of files; this is how search engines acquire new documents
46
Push feed
Alerts subscribers to new documents
47
Pull feed
Requires the subscriber to check periodically