01 Web search crawler Flashcards

1
Q

what is webcrawling

A

the process of locating, fetching, storing pages available in the web

computer programs that perform this task referred to as:
- crawler
- spider
- harvester
- robot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is web crawler repository

A
  • cache for the online content
  • provide quick access to physical copies of pages
  • speed up indexing process
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the fundamental assumption

A
  • the web is well linked
    crawlers exploit the hyperlink structure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

basic web crawling process

A
  • initialise url download queue (URL frontier) with some seed url
  • repeat
  • fetch content of url from queue
  • store fetched content in repository
  • extract hyperlink from content
  • add new links to download queue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what are the crawler requirements**

A

scalability
- distribute and increase crawl rate by adding machines

robustness
- avoid spam and spider trap

selection
- cannot index everything, how to select

duplicates
- integrate duplication detection

politeness
- avoid overloading crawed sites

freshness
- refresh crawled content

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what are the crawling challenges

A
  1. how to distribute crawling
  2. how to make best use of resources
  3. how deep should the site be crawled
  4. how often should we crawl
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

crawling can be divided into 3 sets

A
  1. downloaded
  2. discovered
  3. undiscovered
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is robot.txt

A

explicit politeness: advise web crawlers on which part of the site is accessible

implicit politeness
even without specification, avoid hitting site too often

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

spider traps

A

crawlers need to avoid
- ill formed HTML
- misleading/ hostile sites
- spams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

solutions to spider traps

A
  1. no automatic technique can be foolproof
  2. check for url length
  3. trap guards
    - prepare crawl statistics
    - add black list to guard module
    - eliminate url with non textual data type
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

duplication detection

A

if page is already in index, avoid wasting resources

exact duplicates:
easy to eliminate using hashing

near duplicates:
difficult to eliminate
identified using document fingerprint or shingles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

key crawling components

A
  1. url frontier: queue for url to be crawled
  2. seen url: set() of crawled links
  3. fetcher: download url
  4. parser: extract outgoing links
  5. url filtering: filter url that are images
  6. content seen filtering: eliminate duplicate pages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

URL prioritisation

A

2 queues for URL
1. discovery queue
- random
- breadth first
- in degree (has more links pointed to itself)
- page rank

  1. refreshing queue
    - random
    - age (older)
    - page rank
    - user feedback
    - longevity (how often is page updated)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

breadth first search**

A

append new url to end of the queue

FIFO

requires memory of all nodes on the previous level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

depth first search**

A

append new url to the start of the queue

LIFO

memory of only depth times branching factor but may get too deep

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is the crawling metrics

A

Quality metrics
1. coverage
2. freshness
3. page importance

Performance metrics
1. throughput: content download rate

17
Q

mirror sites

A

replica of existing sites
can lead to redundant crawling
can be detected using:
- url similarity
- link structure
- content similarity

18
Q

geographically distributed webcrawling

A
  1. higher crawling throughput
    - proximity
    - lower crawling latency
  2. improved politeness
    - less overhead on routers because of fewer hops
  3. better coverage
  4. increased availability
19
Q

why is data structure important

A

efficiency for webcrawler
seen URL table, as url continue to grow
highspace requirements
frequent url cached in memory