Search Engines and Crawlers Flashcards by Jess Casey

Vertical search

More limited than web search (e.g. only certain file formats will be shown)

How well did you know this?

Not at all

Perfectly

Enterprise search

Searching for company documentation

How well did you know this?

Not at all

Perfectly

Desktop search

Searching for data inside files

How well did you know this?

Not at all

Perfectly

Classification

Compares documents

How well did you know this?

Not at all

Perfectly

Ad-hoc search

Searching through unstructured data

How well did you know this?

Not at all

Perfectly

Relevance in information retrieval

o Topical relevance and user relevance
o Retrieval models – how the results will be presented (e.g. a list of links)
o Ranking algorithms

How well did you know this?

Not at all

Perfectly

Evaluation in information retrieval

o Precision and recall – when multiple users search for the same thing, they should be shown the same result
o Test collections
o Clickthrough and log data

How well did you know this?

Not at all

Perfectly

Information needs

o Query suggestions (e.g. autofill)
o Query expansion – providing other potentially relevant data
o Relevance feedback (e.g. showing how many times an academic paper has been cited)

How well did you know this?

Not at all

Perfectly

What are the primary goals of a search engine?

Effectiveness – retrieving the most relevant set of documents possible
Efficiency – processing queries as quickly as possible

How well did you know this?

Not at all

Perfectly

Issues in search engines

o Performance - efficient searching and indexing
o Incorporating new data
o Scalability – growing with data and users (e.g. handling large amounts of traffic)
o Adaptability – tuning for applications (e.g. adapting for use on a variety of devices)

How well did you know this?

Not at all

Perfectly

Document statistics

Gathering and recording statistical information about words and documents

How well did you know this?

Not at all

Perfectly

How are document statistics used?

The gathered information is stored in lookup tables and used by ranking algorithms

How well did you know this?

Not at all

Perfectly

Lookup table

Data structure designed for quick retrieval

How well did you know this?

Not at all

Perfectly

Weighting

Calculating weight using document statistics and storing it in a lookup table

How well did you know this?

Not at all

Perfectly

tf.idf weighting

Giving high weights to terms that appear in very few documents

How well did you know this?

Not at all

Perfectly

True or false: Weight is calculated during the query process

False! It can be calculated as part of the query process, but calculating during indexing makes querying more efficient

How well did you know this?

Not at all

Perfectly

Inversion

Changing document-term info into term-document info

How well did you know this?

Not at all

Perfectly

Methods of query transformation

o Spell checking
o Query suggestion
o Suggesting additional terms via query expansion

How well did you know this?

Not at all

Perfectly

When search results are displayed, snippets are generated to…

Study These Flashcards

o Summarise retrieved documents
o Identify related groups of documents
o Highlight important words and passages

Document data store

Study These Flashcards

A database that manages large numbers of documents and the structured data (usually metadata or links) associated with them

True or false: A document data store can be stored in a relational database

Study These Flashcards

True, but some applications use other storage systems for faster retrieval

Scoring

Study These Flashcards

Calculating scores for documents using ranking

Performance optimisation

Study These Flashcards

Designing ranking algorithms to decrease response time and increase query throughput

Methods of distributing ranked documents

Study These Flashcards

o Query broker

o Caching results of common search queries

Parser

Processes sequences of text tokens (usually words)

Stopping

Removes common words from text tokens to reduce repetition and index size

Stemming

Grouping words that have a similar meaning (e.g. "fish" and "fishing") and replacing them with a designated word

Why is stemming used?

Words in queries and documents are more likely to match

Information extraction

Using syntactic analysis to identify complex index terms

Classifier

o Identify class-related metadata o Assign labels to documents representing topic categories o Group documents without pre-defined categories

Crawler

Follows links to web pages to discover and download new pages

Web crawling

* Client program connects to domain name system (DNS) server * DNS server translates host name into an internet protocol (IP) address * Program attempts to connect to computer with that IP address * Once connection is established, client program sends a HTTP request to the server

Politeness policies

Standards that aim to reduce a crawler's impact on a web server's performance. This could include a politeness window (time between requesting pages) being prevented from accessing certain pages

Focused crawling

Relies on web pages linking to other pages on the same topic

How does focused crawling work?

* Can use popular pages as seeds * Use text classifiers to determine what page is about * If page is on topic, keeps the page and uses its links to find related sites

How does a focused crawler decide which pages to visit next?

* Tracking topicality of downloaded pages to determine whether to download similar pages * Anchor text data and topicality data can be combined to determine which pages to visit next

Deep web

Pages that are difficult for a crawler to find

Examples of deep web pages

* Sites that require an account * Form results (e.g. flight timetables, product search) * Scripted pages

How can web pages become easier to find?

Sitemaps

Issues in crawlers and how to solve them

* Text documents stored in incompatible formats - can be converted to tagged formats * Crawling can be expensive in terms of CPU and network load - may be useful to store documents

BigTable

A distributed database system in which the table is split into small pieces (tablets), which are served by thousands of machines

How are changes recorded in BigTable?

Changes are recorded in a transaction log and stored in a shared file system

True or false: If a BigTable tablet server crashes, the whole table is inaccessible

False! Another server will immediately read the tablet data and transaction log and take over

True or false: BigTable is a relational database model

False! Unlike relational databases, not all rows have the same columns.

Feed

Real time stream of files; this is how search engines acquire new documents

Push feed

Alerts subscribers to new documents

Pull feed

Requires the subscriber to check periodically

Search Engines and Crawlers Flashcards

(47 cards)