sdi 4 Flashcards by ace ang

youtube - what are the different api calls we can have?

uploadVideo, searchVideo, streamVideo

How well did you know this?

Not at all

Perfectly

uploadVideo api call

uploadVideo(api_dev_key, video_title, video_description, tags[], category_id, default_language,
recording_details, video_contents)

video_contents (stream): Video to be uploaded.

Returns: (string) - A successful upload will return HTTP 202 (request accepted) and once video encoding is completed, user is notified through email w/ a link to access the video.

How well did you know this?

Not at all

Perfectly

searchVideo api call

searchVideo(api_dev_key, search_query, user_location, maximum_videos_to_return, page_token)

page_token (string): specifies a page in the result set that should be returned.

Returns: (JSON)
A JSON containing info about the list of video resources matching the search query. Each video resource will have a title, thumbnail, creation date, view count.

How well did you know this?

Not at all

Perfectly

streamVideo api call

streamVideo(api_dev_key, video_id, offset, codec, resolution)

offset (number): time in seconds from beginning of video. If we support playing/pausing a video from multiple devices, we need to store the offset on the server.

codec (string) & resolution(string): needed to support play/pause from multiple devices. You’re watching a video on your TV’s
Netflix app, paused it, and started watching it on your phone’s Netflix app - these devices have a different res and use a different codec.

Returns: (STREAM)
A media stream (a video chunk) from the given offset.

How well did you know this?

Not at all

Perfectly

youtube high level design

Processing Queue: Each uploaded video will be pushed to a q for encoding, thumbnail generation, and storage.
Encoder encodes uploaded videos into multiple formats.
Thumbnails generator
Video and Thumbnail storage
User Database (mysql)
Video metadata storage (also mysql): title, file path in the system, uploading user, total views, likes, dislikes, video comments.

How well did you know this?

Not at all

Perfectly

youtube r/w ratio?

Read-heavy - we’ll focus on building a system that can retrieve videos quickly. 200/1

How well did you know this?

Not at all

Perfectly

youtube - where do we store vids?

Videos can be stored in a distributed file storage system like HDFS or
GlusterFS.

How well did you know this?

Not at all

Perfectly

youtube - where do we store thumbnails?

Thumbnails are small files, max 5KB each.
Read traffic for thumbnails will be huge compared to videos. Users watch 1 video at a time, but they might be looking at a page w/ 20 thumbnails.

If we store all the thumbnails on a disk, we have to perform a lot of seeks to different locations on the disk to read these files – inefficient and will cause low latencies.

Bigtable (key-value, wide-column store, ideal for fast access to very large amounts of data) can be a reasonable choice here as it combines multiple files into 1 block to store on the disk. Keeping hot thumbnails in the cache will also help in improving the latencies and, b/c thumbnails files are small, we can easily cache many.

How well did you know this?

Not at all

Perfectly

youtube - How should we efficiently manage read traffic?

Segregate read traffic from write. Since we’ll have multiple copies of each video, we can distribute read traffic on different servers.
For metadata, we can have master-slave configurations where write go to master first and then gets applied at slaves. This can cause some staleness in data - when a
new video is added, its metadata would be inserted in the master first and before it gets applied at the slave our slaves would not be able to see it; stale results would be returned to the user.
This staleness might be acceptable as it would be very short-lived and the user would be able to see the new videos after a few ms.

How well did you know this?

Not at all

Perfectly

youtube - sharding necessity

Since we have a huge number of new videos every day and our read load is extremely high, therefore,
we need to distribute our data onto multiple machines so that we can perform read/write operations
efficiently.

How well did you know this?

Not at all

Perfectly

youtube - vid duplication

Huge number of users uploading a huge amount of video data - we must deal w/ widespread video duplication. Duplicate videos often differ in aspect ratios or encodings, can contain overlays or additional borders, or can be excerpts from a longer original video. The proliferation of duplicate videos:
1. wastes storage space
2. degrades cache efficiency
3. energy wastage

For the end user, these inefficiencies will be realized in the form of duplicate search results, longer video startup times, and interrupted streaming.

How well did you know this?

Not at all

Perfectly

youtube deduplication

For our service, deduplication makes most sense inline, not post-process. will save us the resources that’d be used to encode, transfer, and store the duplicate copy. As soon as any user starts uploading a video, we run video matching algorithms (e.g., Block Matching, Phase
Correlation) to find duplicates. If duplicate detected, we can either stop the upload and use the existing copy OR continue upload and use new copy if it is of higher quality. If the newly uploaded video is a subpart of an existing video or vice versa, we can divide the video into smaller chunks so we only upload the missing chunks.

How well did you know this?

Not at all

Perfectly

youtube load balancing - come back to this

How well did you know this?

Not at all

Perfectly

youtube CDNs

Our service can move popular videos to CDNs:
* CDNs replicate content in multiple places. There’s a better chance of videos being closer to the
user and, with fewer hops, videos will stream from a friendlier network.
* CDN machines make heavy use of caching and can mostly serve videos out of memory.

Less popular videos (1-20 views per day) that are not cached by CDNs can be served by our servers in various data centers.

How well did you know this?

Not at all

Perfectly

youtube fault tolerance

We should use Consistent Hashing for distribution among database servers. Consistent hashing will not only help in replacing a dead server, but also help in distributing load among servers.

How well did you know this?

Not at all

Perfectly

typeahead suggestion - high level design

Study These Flashcards

we have a lot of ‘strings’ that we need to store in such a way that users can search with any prefix. we must efficiently store our data such that it can be queried quickly - use trie

typeahead suggestion - how to find top suggestion?

Study These Flashcards

store the count of searches that terminated at each node. So, given a prefix, we can traverse the sub-tree under it to find the top suggestions.

typeahead suggestion - how can we speed up our searches?

Study These Flashcards

We can store top 10 suggestions at each node. We have to bear the big increase in our storage capacity to achieve the required efficiency.
We can optimize our storage by storing only references of the terminal nodes rather than the
entire phrase. To find the entire phrase, we need to traverse back using the parent reference from the terminal node. We’ll also store the frequency along with each reference.

typeahead suggestion - how do we build the trie?

Study These Flashcards

We can efficiently build our trie bottom up. Each parent node will
recursively call their children to calculate their top suggestions and counts. Parents combine top suggestions from children to determine their top suggestions.

typeahead suggestion - What could be different ranking criteria for suggestions?

Study These Flashcards

freshness, user location, language, demographics, personal history etc.

typeahead suggestion - how to update the trie?

Study These Flashcards

Either we can log every query or do sampling and log every 1000th query. if we don’t want to show a term searched for <1000 times, it’s safe to log every 1000th searched term.
We can have a Map-Reduce (MR) set-up to process all the logging data periodically, say, every hour. Once we have frequencies of all searched terms in the past hour, we can update trie. 2 options:
1. We can make a copy of the trie on each server to update it offline. Once done, we can switch using the copy and discard the old one.
2. We can have a master-slave configuration for each trie server. We can update slave while master is serving traffic. Once the update is complete, we make the slave our
new master.

How can we update the frequencies of typeahead suggestions?

Study These Flashcards

We can update only differences
in frequencies rather than recounting all search terms from scratch. Say we’re keeping count of all terms searched in last 10 days. Subtract counts from the time period no longer included and add the counts for new time period. We can add and subtract frequencies based on Exponential Moving Average (EMA) of each term. In EMA, we give more weight to the latest data.

After inserting a new term in the trie, we’ll go to the terminal node of the phrase and increase its
frequency. This new frq is in top 10; go up the tree and make updates at each parent as necessary.

How to store trie in a file so that we can rebuild our trie easily - this will be needed when a
machine restarts?

Study These Flashcards

We can take a snapshot of our trie periodically and store it in a file. This will enable us to rebuild a trie if the server goes down. Start with the root node and save the trie level-by-level. Each node is represented by char + num of children.

Not storing top suggestions and top counts – it is hard to store
this information; since our trie is being stored top down, we don’t have child nodes created before the parent. We have to recalculate all the top terms with counts while we are building the trie.

typeahead suggestion - scale estimation

Study These Flashcards

Since there will be a lot of duplicates in 5 billion queries, we can assume that only 20% of these will be unique. we can index top 50% search terms.
Let’s assume we will have 100 million unique terms for which we want to build an index. On avg, each query is 3 words, 5 chars long each.
we should also be removing some terms that aren’t searched anymore. assume we have 2% new queries every day and we maintain our index for the last 1 year.

typeahead suggestion - data sharding

Each term will be passed to a hash function, which will generate a server number and we will store the term on that server. This will make our term distribution random and hence minimize hotspots. To find typeahead suggestions for a term we have to ask all the servers and then aggregate the results.

High level design for Rate Limiter

Rate Limiter will be responsible for deciding which request will be served by the API servers and which request will be declined. Once a new request arrives, the Web Server first asks the Rate Limiter to decide if it will be served or throttled. If the request is not throttled, then it’ll be passed to the API servers.

two types of algorithms used for Rate Limiting

- can't used fixed window b/c Imagine if user sends 3 requests at the last second of a min, then she can send 3 more requests at the first second of the next min = 6 requests in 2 seconds. - Sliding Window takes a lot of memory compared to Fixed -- scalability issue.

rate limiter algo design

combine fixed and sliding window algos -> sliding window w/ counts. if we have an hourly rate limit we can keep a count for each min and calculate the sum of all counters in the past hour when we receive a new request. We can store our counters in a Redis Hash since it offers incredibly efficient storage for fewer than 100 keys. When each request increments a counter in the hash, it also sets the hash to expire some specified time later.

rate limiter - caching

- cache recent active users. - Write-back cache: update all counters and timestamps in cache only. Writing to permanent storage can be done at fixed intervals. This ensures that the rate limiter adds minimum latency to the user’s requests. - LRU

rate limiter - data sharding

- We can shard based on the ‘UserID’. For fault tolerance and replication we should use Consistent Hashing. - If we want different throttling limits for different APIs, we can choose to shard per user per API. If we do this, a practical consideration could be to have a separate (somewhat smaller) rate limiter for each API shard as well.

Should we rate limit by IP or by user?

by IP: - when multiple users share a single public IP like in an internet cafe, one bad user can cause throttling to others - it’s trivial to make a cache server run out of memory tracking IPv6 addresses by user: - hacker can perform a denial of service attack against a user by entering wrong credentials up to the limit; after that the actual user can't login best to do both, though this will result in more cache entries w/ more details per entry, requiring more memory and storag

sdi 4 Flashcards

(31 cards)