Spiders and crawlers Flashcards

(31 cards)

1
Q

Information Retrieval

A

A field concerned with the structure, analysis, organisation, storage, and retrieval of information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Goals of Search Engines

A

Effectiveness - quality - retrieve the most relevant set of documents possible

Efficiency - speed - process results from users as quickly as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Indexing Process

A

Text acquisition, transformation and index creation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Text acquisition

A

Store parsed data into a document data store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Text transformation

A

Remove duplicates from the store, classify information and organise data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Index creation

A

Create an index to quickly locate information - stored in an index database that can be queried by user

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Query Process

A

Interaction, evaluation and log data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Interaction

A

User interacts with document data store, which runs a ranking algorithm from the index to display most relevant results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Evaluation

A

Find out how relevant results are e.g. quiz user on how relevant the results are so the algorithm can be updated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Log data

A

Log user data and update algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Types of text acquisition

A

Crawler and feed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Crawler

A

Web crawler is the most common method of text acquisition: opens a webpage and looks for links

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Challenges faced by crawler

A

Pages are constantly updated so crawlers must be ran very frequently to keep up; may not be able to handle huge volumes of new pages

Can only operate on a single website

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Focused crawler

A

Classification technique to determine whether a page is relevant or not; will not access pages that are deemed irrelevant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Feed

A

Real time stream of documents e.g. a news feed

Search engine acquires new documents simply by monitoring the feed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Feed Conversion

A

Documents in feed are rarely plain text

Search engines require them to be converted into consistent text + metadata

17
Q

Document Data Store

A

Database to manage large numbers of documents and structured data (metadata) associated with them

18
Q

Types of text transformation

A

Parser, stopping and stemming

19
Q

Parser

A

Processes a sequence of text tokens

Uses knowledge of syntax to identify structure of text/information

20
Q

Stopping

A

Removes common words from the stream of tokens e.g. ‘the’, ‘of’, ‘to’, ‘for’

Reduces size of index and does not affect quality

21
Q

Stemming

A

Group words together that derive from a common stem e.g. ‘fish’, ‘fishes’ and ‘fishing’

May not be effective for all languages

22
Q

User interaction

A

Query input, query transformation and results output

23
Q

Query input

A

Small number of keywords to query from a document e.g. web query

24
Q

Query transformation

A

Tokenisation, stopping and stemming must be done to compare with document

25
Results output
Construct display of ranked documents - snippets of documents, important words/passages etc
26
Logging
One of the most valuable sources for tuning and improving search engines Ranking analysis uses log data to compare effectiveness of algorithm Simulations
27
Politeness Policy
Some crawlers may have a limit to one page every x seconds instead of max speed
28
Ratelimits
A limit of times a single IP address is allowed to access a server
29
Deep Web
Private sites (accounts), form results (bus/flight timetables), scripted pages (javascript) Difficult for a crawler to find
30
Conversion Problem
Text document stored in incompatible formats; PDF, raw text, RTF, HTML, XML, others Sometimes in PPT/Excel documents or obsolete formats
31
Big Table
Used internally at Google A distributed database built for storing web pages. A big table Split into tablets served by thousands of machines, any changes are recorded in the transaction log If a tablet crashes then another server can read data from transaction log and take over