Spiders and crawlers Flashcards by Alex Freeman

Information Retrieval

A field concerned with the structure, analysis, organisation, storage, and retrieval of information.

How well did you know this?

Not at all

Perfectly

Goals of Search Engines

Effectiveness - quality - retrieve the most relevant set of documents possible

Efficiency - speed - process results from users as quickly as possible

How well did you know this?

Not at all

Perfectly

Indexing Process

Text acquisition, transformation and index creation

How well did you know this?

Not at all

Perfectly

Text acquisition

Store parsed data into a document data store

How well did you know this?

Not at all

Perfectly

Text transformation

Remove duplicates from the store, classify information and organise data

How well did you know this?

Not at all

Perfectly

Index creation

Create an index to quickly locate information - stored in an index database that can be queried by user

How well did you know this?

Not at all

Perfectly

Query Process

Interaction, evaluation and log data

How well did you know this?

Not at all

Perfectly

Interaction

User interacts with document data store, which runs a ranking algorithm from the index to display most relevant results

How well did you know this?

Not at all

Perfectly

Evaluation

Find out how relevant results are e.g. quiz user on how relevant the results are so the algorithm can be updated

How well did you know this?

Not at all

Perfectly

Log data

Log user data and update algorithm

How well did you know this?

Not at all

Perfectly

Types of text acquisition

Crawler and feed

How well did you know this?

Not at all

Perfectly

Crawler

Web crawler is the most common method of text acquisition: opens a webpage and looks for links

How well did you know this?

Not at all

Perfectly

Challenges faced by crawler

Pages are constantly updated so crawlers must be ran very frequently to keep up; may not be able to handle huge volumes of new pages

Can only operate on a single website

How well did you know this?

Not at all

Perfectly

Focused crawler

Classification technique to determine whether a page is relevant or not; will not access pages that are deemed irrelevant

How well did you know this?

Not at all

Perfectly

Feed

Real time stream of documents e.g. a news feed

Search engine acquires new documents simply by monitoring the feed

How well did you know this?

Not at all

Perfectly

Feed Conversion

Study These Flashcards

Documents in feed are rarely plain text

Search engines require them to be converted into consistent text + metadata

Document Data Store

Study These Flashcards

Database to manage large numbers of documents and structured data (metadata) associated with them

Types of text transformation

Study These Flashcards

Parser, stopping and stemming

Parser

Study These Flashcards

Processes a sequence of text tokens

Uses knowledge of syntax to identify structure of text/information

Stopping

Study These Flashcards

Removes common words from the stream of tokens e.g. ‘the’, ‘of’, ‘to’, ‘for’

Reduces size of index and does not affect quality

Stemming

Study These Flashcards

Group words together that derive from a common stem e.g. ‘fish’, ‘fishes’ and ‘fishing’

May not be effective for all languages

User interaction

Study These Flashcards

Query input, query transformation and results output

Query input

Study These Flashcards

Small number of keywords to query from a document e.g. web query

Query transformation

Study These Flashcards

Tokenisation, stopping and stemming must be done to compare with document

Results output

Construct display of ranked documents - snippets of documents, important words/passages etc

Logging

One of the most valuable sources for tuning and improving search engines Ranking analysis uses log data to compare effectiveness of algorithm Simulations

Politeness Policy

Some crawlers may have a limit to one page every x seconds instead of max speed

Ratelimits

A limit of times a single IP address is allowed to access a server

Deep Web

Private sites (accounts), form results (bus/flight timetables), scripted pages (javascript) Difficult for a crawler to find

Conversion Problem

Text document stored in incompatible formats; PDF, raw text, RTF, HTML, XML, others Sometimes in PPT/Excel documents or obsolete formats

Big Table

Used internally at Google A distributed database built for storing web pages. A big table Split into tablets served by thousands of machines, any changes are recorded in the transaction log If a tablet crashes then another server can read data from transaction log and take over

Spiders and crawlers Flashcards

(31 cards)