sdi 5 Flashcards
tweet search api call
search(api_dev_key, search_terms, maximum_results_to_return, sort, page_token)
sort (number): Optional sort mode: Latest first (0 - default), Best matched (1), Most liked (2).
page_token (string): This token will specify a page in the result set that should be returned.
Returns: A JSON containing info about a list of tweets matching the search. Each result can have the user ID & name, tweet text, tweet ID, creation time, number of likes, etc.
tweet search high level design
we need to store all the statues in a database and also build an index that can keep
track of which word appears in which tweet. This index will help us quickly find tweets that users are
trying to search.
tweet search indexing
- the index should tell us which word comes in which tweet object. 300K English words + 200K nouns at 5 chars per word = 2.5MB of memory to store all words.
- perhaps we want to keep the index in memory for all tweets from only past 2 years
- our index would be like a big distributed hash table, where ‘key’ would be the word and ‘value’ will
be a list of TweetIDs of all those tweets which contain that word.
tweet search data sharding
- we can shard based on words (get hash of each word); to find all tweets containing a specific word we have to query only that server. Issues: hot words (lots of queries on that server) and words that appear in many tweets (non-uniform distribution of words)
- get a hash of TweetID and index all the words of the tweet on that server. While querying for a word, we query all the servers, and each server returns a set of TweetIDs. A centralized server will
aggregate these results to return to the user.
tweet search - What if both primary and secondary servers die at the same time?
We have to allocate a new server and rebuild the same index on it.
Our Index-Builder server can hold a HT where ‘key’ is index server number and ‘value’ is a HashSet containing all the TweetIDs being kept at that index server.
Whenever an index server has to rebuild itself, it can ask the Index-Builder server for all the tweets it needs to store and then fetch those tweets to build the index. This approach will surely be fast. We should also have a replica of the Index-Builder server for fault tolerance.
- avg size of html page?
- avg size of metadata for each page?
- 100 KB
- 500 bytes
how do search engines use web crawlers?
Web crawlers systematically browse webpages to learn what each page on the website is about, so this information can be indexed, updated and retrieved when a user makes a search query.
web crawler high level design
it takes a list of seed URLs as its input, then
1. Pick a URL from the unvisited URL list.
2. Establish a connection to host to download the corresponding document.
3. Parse the document contents to look for new URLs.
4. Add the new URLs to the list of unvisited URLs.
5. Process the downloaded document, e.g., store it or index its contents, etc.
6. repeat
web crawler BFS vs DFS
usually BFS. However, DFS is used sometimes; ex: if your crawler has already established a connection with the website, it might just DFS all the URLs within this website to save some handshaking overhead.
Path-ascending crawling:
Path-ascending crawling can help discover isolated resources or resources for which no inbound link would have been found w/ regular crawling of a particular site. In this scheme, a crawler ascends to every path in each URL that it intends to crawl. Ex: when given a seed URL of http://foo.com/a/b/page.html, it will attempt to crawl /a/b/, /a/, and /.
There are two important characteristics of the Web that makes Web crawling a very difficult task:
- Large volume of Web pages: the crawler can only download a fraction of the web pages at any time and hence it is critical that web crawler should be intelligent enough to prioritize download.
- web pages on the internet change very frequently. By the time the crawler is downloading the last page from a site, the page may change, or a new page may be added.
A bare minimum crawler needs at least these components:
- URL frontier: stores the list of URLs to download and prioritizes which URLs should be crawled first.
- HTTP Fetcher: To retrieve a web page from the server.
- Extractor: To extract links from HTML documents.
- Duplicate Eliminator: makes sure the same content is not extracted twice.
- Datastore: To store retrieved pages, URLs, and other metadata.
web crawler - URL frontier
contains all the URLs that remain to be downloaded.
Since we have a huge list of URLs to crawl, we can distribute our URL frontier onto multiple servers. Let’s assume:
- on each server we have multiple worker threads performing the crawling tasks.
- we have a hash function mapping each URL to the server responsible for crawling it.
keep in mind politeness requirements.
web crawler - politeness requirements
- Our crawler should not overload a server by downloading a lot of pages from it.
- We should not have multiple machines connecting a web server.
our crawler can have a collection of distinct FIFO sub-queues on each server. Each worker thread has its own sub-queue. When a new URL needs to be added, the sub-queue in which it is placed will be determined by the URL’s canonical hostname, using a hash function.
how do we handle the large size of our url frontier?
The size would be in the hundreds of millions of URLs. Hence,
we need to store our URLs on a disk.
- Enqueue buffer, once filled, will be dumped to the disk.
- Dequeue buffer will keep a cache of URLs that need to be visited; it can periodically read from disk to fill itself.
web crawler - fetcher
to download the document corresponding to a given URL using the appropriate network protocol like HTTP.
When dealing w/ RobotsExclusion: To avoid downloading
robot.txt on every request, our fetcher module can maintain a cache mapping host-names to their robot exclusion rules.
RobotsExclusion
Courteous Web crawlers implement the Robots Exclusion Protocol, which allows Webmasters to declare parts of their sites off limits to
crawlers. The Robots Exclusion Protocol requires a Web crawler to fetch a special document called robot.txt which contains these declarations from a Web site before downloading any real content from it.
web crawler - Document input stream
The same document is processed by multiple processing modules, so to avoid downloading a document multiple times, we cache the document locally using an abstraction called a Document Input Stream (DIS). it caches the entire contents of the document read from the internet and provides methods to re-read the document. Each worker thread has an associated DIS, which it reuses from document to document. After extracting a URL from the frontier, the worker passes that URL to the relevant protocol module, which initializes the DIS from a network connection to contain the document’s contents. The worker then passes the DIS to all relevant processing modules.
web crawler Document Dedupe test
Many documents on the Web are available under multipl URLs.
This will cause the crawler to download the same document multiple times. To prevent processing of a document more than once, we perform a dedupe test on each document to remove duplication. We can calculate a 64-bit checksum, using MD5 or SHA, of every processed document and store it in a database (hash set). For every new document, we can compare its checksum to the set.
web crawler - URL filtering mechanism
provides a customizable way to control the set of URLs
that are downloaded. This is used to blacklist websites so that our crawler can ignore them. Before
adding each URL to the frontier, the worker thread consults the user-supplied URL filter. We can define
filters to restrict URLs by domain, prefix, or protocol type.
web crawler - domain name cache
Before contacting a Web server, a crawler must use the Domain
Name Service (DNS) to map the Web server’s hostname into an IP address. DNS name resolution will
be a big bottleneck given the amount of URLs we will be working with. To avoid repeated requests, we can start caching DNS results by building a local DNS server.
web crawler - URL dedupe test
While extracting links, any Web crawler will encounter multiple links to the same document. implementation is same as document dedupe test
web crawler - checkpointing
A crawl of the entire Web takes weeks to complete. To guard against failures, our
crawler can write regular snapshots of its state to the disk. An interrupted or aborted crawl can easily be
restarted from the latest checkpoint.
web crawler - data sharding
Our crawler will be dealing with three kinds of data: 1) URLs to visit 2) URL checksums for dedupe 3)
Document checksums for dedupe.
Since we are distributing URLs based on the hostnames, we can store these data on the same host.
Since we will be using consistent hashing, we can assume that URLs will be redistributed from overloaded hosts. Each host will perform checkpointing periodically and dump a snapshot of all the data it is holding onto a remote server. This will ensure that if a server dies down another server can replace.