sdi 5 Flashcards by ace ang

tweet search api call

search(api_dev_key, search_terms, maximum_results_to_return, sort, page_token)

sort (number): Optional sort mode: Latest first (0 - default), Best matched (1), Most liked (2).
page_token (string): This token will specify a page in the result set that should be returned.

Returns: A JSON containing info about a list of tweets matching the search. Each result can have the user ID & name, tweet text, tweet ID, creation time, number of likes, etc.

How well did you know this?

Not at all

Perfectly

tweet search high level design

we need to store all the statues in a database and also build an index that can keep
track of which word appears in which tweet. This index will help us quickly find tweets that users are
trying to search.

How well did you know this?

Not at all

Perfectly

tweet search indexing

the index should tell us which word comes in which tweet object. 300K English words + 200K nouns at 5 chars per word = 2.5MB of memory to store all words.
perhaps we want to keep the index in memory for all tweets from only past 2 years
our index would be like a big distributed hash table, where ‘key’ would be the word and ‘value’ will
be a list of TweetIDs of all those tweets which contain that word.

How well did you know this?

Not at all

Perfectly

tweet search data sharding

we can shard based on words (get hash of each word); to find all tweets containing a specific word we have to query only that server. Issues: hot words (lots of queries on that server) and words that appear in many tweets (non-uniform distribution of words)
get a hash of TweetID and index all the words of the tweet on that server. While querying for a word, we query all the servers, and each server returns a set of TweetIDs. A centralized server will
aggregate these results to return to the user.

How well did you know this?

Not at all

Perfectly

tweet search - What if both primary and secondary servers die at the same time?

We have to allocate a new server and rebuild the same index on it.
Our Index-Builder server can hold a HT where ‘key’ is index server number and ‘value’ is a HashSet containing all the TweetIDs being kept at that index server.
Whenever an index server has to rebuild itself, it can ask the Index-Builder server for all the tweets it needs to store and then fetch those tweets to build the index. This approach will surely be fast. We should also have a replica of the Index-Builder server for fault tolerance.

How well did you know this?

Not at all

Perfectly

avg size of html page?
avg size of metadata for each page?

100 KB
500 bytes

How well did you know this?

Not at all

Perfectly

how do search engines use web crawlers?

Web crawlers systematically browse webpages to learn what each page on the website is about, so this information can be indexed, updated and retrieved when a user makes a search query.

How well did you know this?

Not at all

Perfectly

web crawler high level design

it takes a list of seed URLs as its input, then
1. Pick a URL from the unvisited URL list.
2. Establish a connection to host to download the corresponding document.
3. Parse the document contents to look for new URLs.
4. Add the new URLs to the list of unvisited URLs.
5. Process the downloaded document, e.g., store it or index its contents, etc.
6. repeat

How well did you know this?

Not at all

Perfectly

web crawler BFS vs DFS

usually BFS. However, DFS is used sometimes; ex: if your crawler has already established a connection with the website, it might just DFS all the URLs within this website to save some handshaking overhead.

How well did you know this?

Not at all

Perfectly

Path-ascending crawling:

Path-ascending crawling can help discover isolated resources or resources for which no inbound link would have been found w/ regular crawling of a particular site. In this scheme, a crawler ascends to every path in each URL that it intends to crawl. Ex: when given a seed URL of http://foo.com/a/b/page.html, it will attempt to crawl /a/b/, /a/, and /.

How well did you know this?

Not at all

Perfectly

There are two important characteristics of the Web that makes Web crawling a very difficult task:

Large volume of Web pages: the crawler can only download a fraction of the web pages at any time and hence it is critical that web crawler should be intelligent enough to prioritize download.
web pages on the internet change very frequently. By the time the crawler is downloading the last page from a site, the page may change, or a new page may be added.

How well did you know this?

Not at all

Perfectly

A bare minimum crawler needs at least these components:

URL frontier: stores the list of URLs to download and prioritizes which URLs should be crawled first.
HTTP Fetcher: To retrieve a web page from the server.
Extractor: To extract links from HTML documents.
Duplicate Eliminator: makes sure the same content is not extracted twice.
Datastore: To store retrieved pages, URLs, and other metadata.

How well did you know this?

Not at all

Perfectly

web crawler - URL frontier

contains all the URLs that remain to be downloaded.
Since we have a huge list of URLs to crawl, we can distribute our URL frontier onto multiple servers. Let’s assume:
- on each server we have multiple worker threads performing the crawling tasks.
- we have a hash function mapping each URL to the server responsible for crawling it.

keep in mind politeness requirements.

How well did you know this?

Not at all

Perfectly

web crawler - politeness requirements

Our crawler should not overload a server by downloading a lot of pages from it.
We should not have multiple machines connecting a web server.

our crawler can have a collection of distinct FIFO sub-queues on each server. Each worker thread has its own sub-queue. When a new URL needs to be added, the sub-queue in which it is placed will be determined by the URL’s canonical hostname, using a hash function.

How well did you know this?

Not at all

Perfectly

how do we handle the large size of our url frontier?

The size would be in the hundreds of millions of URLs. Hence,
we need to store our URLs on a disk.
- Enqueue buffer, once filled, will be dumped to the disk.
- Dequeue buffer will keep a cache of URLs that need to be visited; it can periodically read from disk to fill itself.

How well did you know this?

Not at all

Perfectly

web crawler - fetcher

to download the document corresponding to a given URL using the appropriate network protocol like HTTP.
When dealing w/ RobotsExclusion: To avoid downloading
robot.txt on every request, our fetcher module can maintain a cache mapping host-names to their robot exclusion rules.

RobotsExclusion

Courteous Web crawlers implement the Robots Exclusion Protocol, which allows Webmasters to declare parts of their sites off limits to
crawlers. The Robots Exclusion Protocol requires a Web crawler to fetch a special document called robot.txt which contains these declarations from a Web site before downloading any real content from it.

web crawler - Document input stream

The same document is processed by multiple processing modules, so to avoid downloading a document multiple times, we cache the document locally using an abstraction called a Document Input Stream (DIS). it caches the entire contents of the document read from the internet and provides methods to re-read the document. Each worker thread has an associated DIS, which it reuses from document to document. After extracting a URL from the frontier, the worker passes that URL to the relevant protocol module, which initializes the DIS from a network connection to contain the document’s contents. The worker then passes the DIS to all relevant processing modules.

web crawler Document Dedupe test

Many documents on the Web are available under multipl URLs.
This will cause the crawler to download the same document multiple times. To prevent processing of a document more than once, we perform a dedupe test on each document to remove duplication. We can calculate a 64-bit checksum, using MD5 or SHA, of every processed document and store it in a database (hash set). For every new document, we can compare its checksum to the set.

web crawler - URL filtering mechanism

provides a customizable way to control the set of URLs
that are downloaded. This is used to blacklist websites so that our crawler can ignore them. Before
adding each URL to the frontier, the worker thread consults the user-supplied URL filter. We can define
filters to restrict URLs by domain, prefix, or protocol type.

web crawler - domain name cache

Before contacting a Web server, a crawler must use the Domain
Name Service (DNS) to map the Web server’s hostname into an IP address. DNS name resolution will
be a big bottleneck given the amount of URLs we will be working with. To avoid repeated requests, we can start caching DNS results by building a local DNS server.

web crawler - URL dedupe test

While extracting links, any Web crawler will encounter multiple links to the same document. implementation is same as document dedupe test

web crawler - checkpointing

A crawl of the entire Web takes weeks to complete. To guard against failures, our
crawler can write regular snapshots of its state to the disk. An interrupted or aborted crawl can easily be
restarted from the latest checkpoint.

web crawler - data sharding

Our crawler will be dealing with three kinds of data: 1) URLs to visit 2) URL checksums for dedupe 3)
Document checksums for dedupe.
Since we are distributing URLs based on the hostnames, we can store these data on the same host.

Since we will be using consistent hashing, we can assume that URLs will be redistributed from overloaded hosts. Each host will perform checkpointing periodically and dump a snapshot of all the data it is holding onto a remote server. This will ensure that if a server dies down another server can replace.

news feed capacity estimation

let’s assume we need to have 500 posts, avg 1 KB each, in every user’s feed that we want to keep in memory for a quick fetch.

news feed api call

getUserFeed(api_dev_key, user_id, since_id, count, max_id, exclude_replies) - since_id (number): Optional; returns results with an ID higher than (more recent than) the specified ID. - count (number): Optional; specifies the number of feed items to try and retrieve; max 200 - max_id (number): Optional; returns results with an ID less than (older than) or specified ID. Returns: (JSON) Returns a JSON object containing a list of feed items.

news feed db design - tables

3 primary objects: User, Entity (page/group), and FeedItem (post). 6 tables: 1. user 2. entity 3. UserFollow, where pk is follower ID + followee ID and we have a 'type' col that uses an int to represent whether the followee is a user or an entity 4. feed item 5. FeedMedia, where pk is feed item ID + media ID 6. media

news feed generation steps

1. Retrieve IDs of all users and entities that Jane follows. 2. Retrieve latest, most popular and relevant posts for those IDs. 3. Rank these posts based on their relevance to Jane 4. Store this feed in the cache and return top posts (say 20) to be rendered on Jane’s feed. 5. When Jane reaches the end of her current feed, she can fetch the next 20 posts.

news feed service components list

1. Web servers: To maintain a connection w/ user, used to transfer data 2. Application server: To execute the workflows of storing new posts in db servers, and to get and push the newsfeed to the end user. 3. Metadata db and cache for users, entities, and feed items 4. Video and photo storage + cache: Blob storage 5. Newsfeed generation service: To gather and rank all posts for a user and store in the cache. This service will also receive live updates to be added to any user’s timeline. 6. Notification service for updated feed

why should we do pre-generation for newsfeed?

- generating the timeline when a user loads their page leads to high latency - For live updates, each status update will result in feed updates for all followers. This could result in high backlogs - For live updates, the server pushing newer posts to users could get heavy loads, especially for users/entities w/ a lot of followers

how do we do pre-generation for newsfeed?

We can have dedicated servers that are continuously generating users’ feeds and storing them in memory. They will query to see what was the last time the feed was generated for a user. Then, new feed data would be generated from that time onwards. We can store this data in a HT where the “key” is UserID and “value” would be a STRUCT like this: Struct { LinkedHashMap feedItems; DateTime lastGenerated; } LinkedHashMap allows us to not only jump to any feed item but also iterate through the map easily.

Should we generate (and keep in memory) newsfeeds for all users?

There will be a lot of users that don’t login frequently. We can figure out the login pattern of users to pre-generate their newsfeed, e.g., at what time of the day and which days of the week does a user access their newsfeed?

news feed publishing - pull model

(Fan-out-on-load) Possible problems - New data might not be shown to the users until they issue a pull request - Hard to find the right pull cadence, as pull requests will give an empty response if there is no new data, causing waste of resources.

news feed publishing - push model

(Fan-out-on-write) significantly reduces read operations, but for celeb users, the server has to push updates to a lot of people

news feed publishing - hybrid model

option 1: only push data for less popular users. For celeb users, we can let the followers pull the updates. option 2: once a user publishes a post, we can limit the fanout to only her online friends

news feed ranking

number of likes, comments, shares, time of the update, whether the post has images/videos, etc. -- a score can be calculated using these features. A better ranking system can constantly evaluate if we are making progress in user stickiness, retention, ads revenue, etc.