sdi 4 Flashcards
(31 cards)
youtube - what are the different api calls we can have?
uploadVideo, searchVideo, streamVideo
uploadVideo api call
uploadVideo(api_dev_key, video_title, video_description, tags[], category_id, default_language,
recording_details, video_contents)
video_contents (stream): Video to be uploaded.
Returns: (string) - A successful upload will return HTTP 202 (request accepted) and once video encoding is completed, user is notified through email w/ a link to access the video.
searchVideo api call
searchVideo(api_dev_key, search_query, user_location, maximum_videos_to_return, page_token)
page_token (string): specifies a page in the result set that should be returned.
Returns: (JSON)
A JSON containing info about the list of video resources matching the search query. Each video resource will have a title, thumbnail, creation date, view count.
streamVideo api call
streamVideo(api_dev_key, video_id, offset, codec, resolution)
offset (number): time in seconds from beginning of video. If we support playing/pausing a video from multiple devices, we need to store the offset on the server.
codec (string) & resolution(string): needed to support play/pause from multiple devices. You’re watching a video on your TV’s
Netflix app, paused it, and started watching it on your phone’s Netflix app - these devices have a different res and use a different codec.
Returns: (STREAM)
A media stream (a video chunk) from the given offset.
youtube high level design
- Processing Queue: Each uploaded video will be pushed to a q for encoding, thumbnail generation, and storage.
- Encoder encodes uploaded videos into multiple formats.
- Thumbnails generator
- Video and Thumbnail storage
- User Database (mysql)
- Video metadata storage (also mysql): title, file path in the system, uploading user, total views, likes, dislikes, video comments.
youtube r/w ratio?
Read-heavy - we’ll focus on building a system that can retrieve videos quickly. 200/1
youtube - where do we store vids?
Videos can be stored in a distributed file storage system like HDFS or
GlusterFS.
youtube - where do we store thumbnails?
- Thumbnails are small files, max 5KB each.
- Read traffic for thumbnails will be huge compared to videos. Users watch 1 video at a time, but they might be looking at a page w/ 20 thumbnails.
If we store all the thumbnails on a disk, we have to perform a lot of seeks to different locations on the disk to read these files – inefficient and will cause low latencies.
Bigtable (key-value, wide-column store, ideal for fast access to very large amounts of data) can be a reasonable choice here as it combines multiple files into 1 block to store on the disk. Keeping hot thumbnails in the cache will also help in improving the latencies and, b/c thumbnails files are small, we can easily cache many.
youtube - How should we efficiently manage read traffic?
Segregate read traffic from write. Since we’ll have multiple copies of each video, we can distribute read traffic on different servers.
For metadata, we can have master-slave configurations where write go to master first and then gets applied at slaves. This can cause some staleness in data - when a
new video is added, its metadata would be inserted in the master first and before it gets applied at the slave our slaves would not be able to see it; stale results would be returned to the user.
This staleness might be acceptable as it would be very short-lived and the user would be able to see the new videos after a few ms.
youtube - sharding necessity
Since we have a huge number of new videos every day and our read load is extremely high, therefore,
we need to distribute our data onto multiple machines so that we can perform read/write operations
efficiently.
youtube - vid duplication
Huge number of users uploading a huge amount of video data - we must deal w/ widespread video duplication. Duplicate videos often differ in aspect ratios or encodings, can contain overlays or additional borders, or can be excerpts from a longer original video. The proliferation of duplicate videos:
1. wastes storage space
2. degrades cache efficiency
3. energy wastage
For the end user, these inefficiencies will be realized in the form of duplicate search results, longer video startup times, and interrupted streaming.
youtube deduplication
For our service, deduplication makes most sense inline, not post-process. will save us the resources that’d be used to encode, transfer, and store the duplicate copy. As soon as any user starts uploading a video, we run video matching algorithms (e.g., Block Matching, Phase
Correlation) to find duplicates. If duplicate detected, we can either stop the upload and use the existing copy OR continue upload and use new copy if it is of higher quality. If the newly uploaded video is a subpart of an existing video or vice versa, we can divide the video into smaller chunks so we only upload the missing chunks.
youtube load balancing - come back to this
youtube CDNs
Our service can move popular videos to CDNs:
* CDNs replicate content in multiple places. There’s a better chance of videos being closer to the
user and, with fewer hops, videos will stream from a friendlier network.
* CDN machines make heavy use of caching and can mostly serve videos out of memory.
Less popular videos (1-20 views per day) that are not cached by CDNs can be served by our servers in various data centers.
youtube fault tolerance
We should use Consistent Hashing for distribution among database servers. Consistent hashing will not only help in replacing a dead server, but also help in distributing load among servers.
typeahead suggestion - high level design
we have a lot of ‘strings’ that we need to store in such a way that users can search with any prefix. we must efficiently store our data such that it can be queried quickly - use trie
typeahead suggestion - how to find top suggestion?
store the count of searches that terminated at each node. So, given a prefix, we can traverse the sub-tree under it to find the top suggestions.
typeahead suggestion - how can we speed up our searches?
We can store top 10 suggestions at each node. We have to bear the big increase in our storage capacity to achieve the required efficiency.
We can optimize our storage by storing only references of the terminal nodes rather than the
entire phrase. To find the entire phrase, we need to traverse back using the parent reference from the terminal node. We’ll also store the frequency along with each reference.
typeahead suggestion - how do we build the trie?
We can efficiently build our trie bottom up. Each parent node will
recursively call their children to calculate their top suggestions and counts. Parents combine top suggestions from children to determine their top suggestions.
typeahead suggestion - What could be different ranking criteria for suggestions?
freshness, user location, language, demographics, personal history etc.
typeahead suggestion - how to update the trie?
Either we can log every query or do sampling and log every 1000th query. if we don’t want to show a term searched for <1000 times, it’s safe to log every 1000th searched term.
We can have a Map-Reduce (MR) set-up to process all the logging data periodically, say, every hour. Once we have frequencies of all searched terms in the past hour, we can update trie. 2 options:
1. We can make a copy of the trie on each server to update it offline. Once done, we can switch using the copy and discard the old one.
2. We can have a master-slave configuration for each trie server. We can update slave while master is serving traffic. Once the update is complete, we make the slave our
new master.
How can we update the frequencies of typeahead suggestions?
We can update only differences
in frequencies rather than recounting all search terms from scratch. Say we’re keeping count of all terms searched in last 10 days. Subtract counts from the time period no longer included and add the counts for new time period. We can add and subtract frequencies based on Exponential Moving Average (EMA) of each term. In EMA, we give more weight to the latest data.
After inserting a new term in the trie, we’ll go to the terminal node of the phrase and increase its
frequency. This new frq is in top 10; go up the tree and make updates at each parent as necessary.
How to store trie in a file so that we can rebuild our trie easily - this will be needed when a
machine restarts?
We can take a snapshot of our trie periodically and store it in a file. This will enable us to rebuild a trie if the server goes down. Start with the root node and save the trie level-by-level. Each node is represented by char + num of children.
Not storing top suggestions and top counts – it is hard to store
this information; since our trie is being stored top down, we don’t have child nodes created before the parent. We have to recalculate all the top terms with counts while we are building the trie.
typeahead suggestion - scale estimation
Since there will be a lot of duplicates in 5 billion queries, we can assume that only 20% of these will be unique. we can index top 50% search terms.
Let’s assume we will have 100 million unique terms for which we want to build an index. On avg, each query is 3 words, 5 chars long each.
we should also be removing some terms that aren’t searched anymore. assume we have 2% new queries every day and we maintain our index for the last 1 year.