Spiders and crawlers Flashcards
(31 cards)
Information Retrieval
A field concerned with the structure, analysis, organisation, storage, and retrieval of information.
Goals of Search Engines
Effectiveness - quality - retrieve the most relevant set of documents possible
Efficiency - speed - process results from users as quickly as possible
Indexing Process
Text acquisition, transformation and index creation
Text acquisition
Store parsed data into a document data store
Text transformation
Remove duplicates from the store, classify information and organise data
Index creation
Create an index to quickly locate information - stored in an index database that can be queried by user
Query Process
Interaction, evaluation and log data
Interaction
User interacts with document data store, which runs a ranking algorithm from the index to display most relevant results
Evaluation
Find out how relevant results are e.g. quiz user on how relevant the results are so the algorithm can be updated
Log data
Log user data and update algorithm
Types of text acquisition
Crawler and feed
Crawler
Web crawler is the most common method of text acquisition: opens a webpage and looks for links
Challenges faced by crawler
Pages are constantly updated so crawlers must be ran very frequently to keep up; may not be able to handle huge volumes of new pages
Can only operate on a single website
Focused crawler
Classification technique to determine whether a page is relevant or not; will not access pages that are deemed irrelevant
Feed
Real time stream of documents e.g. a news feed
Search engine acquires new documents simply by monitoring the feed
Feed Conversion
Documents in feed are rarely plain text
Search engines require them to be converted into consistent text + metadata
Document Data Store
Database to manage large numbers of documents and structured data (metadata) associated with them
Types of text transformation
Parser, stopping and stemming
Parser
Processes a sequence of text tokens
Uses knowledge of syntax to identify structure of text/information
Stopping
Removes common words from the stream of tokens e.g. ‘the’, ‘of’, ‘to’, ‘for’
Reduces size of index and does not affect quality
Stemming
Group words together that derive from a common stem e.g. ‘fish’, ‘fishes’ and ‘fishing’
May not be effective for all languages
User interaction
Query input, query transformation and results output
Query input
Small number of keywords to query from a document e.g. web query
Query transformation
Tokenisation, stopping and stemming must be done to compare with document