Systems Design Flashcards

Question

What are the four most common data partitioning criteria?

Answer 1

1. Key or hash-based partitioning 2. List partitioning 3. Round-robin partitioning 4. Composite partitioning

Answer 2

1. Joins and denormalization 2. Referential integrity 3. Rebalancing

Answer 3

1. Filtering requests 2. Logging requests 3. Transforming requests (adding/removing headers, encrypting/decrypting, compressing) 4. Serving the request from its cache (ideally shared across users)

Answer 4

Maps a request from one client to one/many servers that serve that request. The resources appear to the client as if they originated from the proxy itself.

Answer 5

1. Key-value stores 2. Document databases 3. Wide-column databases 4. Graph databases

Answer 6

A NoSQL database where data is stored in documents which are grouped together in collections (like tables). Unlike rows in SQL, documents can have entirely different schemas.

Answer 7

MongoDB, CouchDB

Answer 8

A NoSQL database where data is stored in dynamically generated columns and where each row can have any set of columns. Like a spreadsheet.

Answer 9

Analyzing large datasets, since each column can be analyzed independently and they fit better into pages, caches, etc.

Answer 10

HBase and Cassandra

Answer 11

A database where relationships are best represented in a graph. There are nodes (entities), metadata about those nodes, and edges that connect those nodes.

Answer 12

Neo4j and InfiniteGraph

Answer 13

Atomicity, Consistency, Isolation, Durability

Answer 14

Consistency, Availability, Partition Tolerance

Answer 15

A theorem that states that it is impossible for a system to simultaneously satisfy more than two of the three CAP guarantees.

Answer 16

The guarantee that all nodes have and return the same data.

Answer 17

The guarantee that every request gets a success a response.

Answer 18

The guarantee that if there are network faults between nodes, the system will continue to work.

Answer 19

By waiting until writes have been committed to all nodes before allowing subsequent reads.

Answer 20

By replicating data across nodes.

Answer 21

By replicating data across nodes and allowing the partitioned networks to continue functioning independently.

Answer 22

Keys are mapped to nodes by hashing them and then moving clockwise on a hash function "ring" until a node is found. The ring can be updated and a region remapped by adding/removing nodes to locations on the ring.

Answer 23

We need to add/remove shards to a database to handle changes in load and usage patterns. Without consistent hashing, we'd have to completely remap all keys anytime we add/remove a shard to/from a distributed hash table. With consistent hashing, we only need to remap the keys for the node being added/removed.

Answer 24

By adding virtual replicas, i.e., adding the same node multiple times to the same hash function ring.

Answer 25

1. Ajax polling 2. HTTP long-polling 3. WebSockets 4. Server-Sent Events (SSE)

Answer 26

Easy to implement, but leads to lots of unnecessary empty requests and responses, which create HTTP overhead.

Answer 27

(Advantages?) Requires persistent open connections to the server.

Answer 28

Only mechanism that enables bi-directional communication, and can be easier to work with for persistent connections. Drawbacks are that it's more resource-intensive and harder to load balance.

Answer 29

1. The client opens a web page from the server 2. The rendered web page opens a connection to the server 3. The server streams events/data in an infinite loop to the client

Answer 30

1. Requirements clarification 2. Back-of-the-envelope estimation 3. System interface definition 4. Data modeling 5. High-level design 6. Detailed design 7. Bottleneck analysis

Answer 31

Functional requirements specify what the system should do, while non-functional requirements specify how the system should do it.

Answer 32

1. What are the functional (user-facing) requirements? 2. What are the non-functional (behavioral) requirements like availability, consistency, etc.? 3. What parts are we building?

Answer 33

1. Notifications 2. Analytics 3. Timelines 4. Media 5. Expiry

Answer 34

- High write throughput with small random reads. Achieves throughput by storing data in memory and flushing to disk periodically. - Very large data (backed by HDFS). (Note: I don't see why MongoDB or any LSM-tree based solution wouldn't work well, too.)

Answer 35

In addition to everything that NoSQL offers, Cassandra is particularly good at: - High write throughput. - Analytics workloads. - Very good availability semantics (quorum). It is weak at: - Read latency. Put caches in front (like EVCache) to resolve.

Answer 36

An orchestration system that allows both containers and non-containers. Closest analogy is K8s but with non-containers too.

Answer 37

A data warehousing layer that goes on top of Hadoop or Spark. Supports HiveSQL which is a flavor of SQL.

Answer 38

Between: - User and web layers. - Web and application (or cache) layers. - Application (or cache) and database layers.

Answer 39

Content delivery network. A kind of cache for large sizes or amounts of static media. Sits between the client and the back-end and serves the static resource if available or acquires it from the back-end if the cache misses.

Answer 40

100:1 or 1000:1

Answer 41

Instead of querying a single node, a set of nodes are all queried simultaneously, and the most up-to-date data among them is taken as the response.

Answer 42

Between CP and AP. CA never happens because single points of failures are big red flags for distributed systems.

Answer 43

Advantages: - Scales effectively because more workers can be added to process the jobs. - Adds fault tolerance and durability to requests because failed jobs can be retried; far more reliable than client-side error checking. Disadvantages: - Work is done asynchronously so needs more UI to tell the user what's happening. - Adds complexity at small scale.

Answer 44

1. Read-after-write consistency: Once a user writes changes, they should always see them. 2. Monotonic reads: Once a user has seen one version of data, they should never see an older version of the same data. 3. Consistent prefix reads: Reads always happen in an order that makes causal sense, e.g., never seeing a reply before a question.

Answer 45

You can suffix each of these with B, e.g. K -> KB. 1. K = 10^3 = 1,000 2. M = 10^6 = 1,000,000 3. G = 10^9 = 1,000,000,000 4. T = 10^12 = 1,000,000,000,000 5. P = 10^15 = 1,000,000,000,000,000 The number of zeros in the unrolled number is equal to the exponent of ten. Add up the exponents when multiplying. Examples of multiplication: 1. K * K = 10^3 * 10^3 = 10^6 = M 2. K * M = 10^3 * 10^6 = 10^9 = G 3. K * G = 10^3 * 10^9 = 10^12 = T 4. K * T = 10^3 * 10^12 = 10^15 = P 5. M * M = 10^6 * 10^6 = 10^12 = T 6. M * G = 10^6 * 10^9 = 10^15 = P ...

Answer 46

About 50K.

Answer 47

About 4 TB.

Answer 48

The number of concurrents on a service changes throughout the day. Add a *10 fudge factor to account for the peaks.

Answer 49

60*60*24 ~= 100K

Answer 50

The guarantee that a system will self-heal and eventually get back into a consistent state.

Answer 51

Clustering is decentralized, automatic, and is done by the nodes themselves. They all talk to each other to persist each other's data. Sharding is centralized and is done by application or service code. Data is mapped to a shard before the shard is ever accessed, and shards are unaware of each other.

Answer 52

Between CP and AP. CA never happens because single points of failures are big red flags for distributed systems.

Answer 53

Usually using an index, or a distributed index. A distributed index is an index that itself is partitioned across nodes. Only needed for secondary indexes because the primary key is already what's typically used to partition the data. If you don’t have an index on the data, then you would have to make a request to all nodes simultaneously and they would all have to perform full scans.

Answer 54

1. Between the two layers. If the cache misses, then the cache itself is responsible for fetching the resource from the next layer. - This option is more common because it prevents a flood of requests from all going onto the next layer. 2. Next to the two layers. The first layer is responsible for querying the cache, and that layer fetches the resource itself from the second layer if the cache misses. - Better if the cache is being used for very large files with low hit percentages, because otherwise with the first option, the cache would be overwhelmed catching up. - Also better for files that are stored statically in the cache (no eviction). - Application understands usage patterns better than the cache.

Answer 55

SQL: - Better for consistent schemas (incl. foreign key constraints) and requirements (including volume) that are not changing quickly. - Need ACID compliance for strict durability, e.g. in e-commerce and finance. - Most SQL databases use B-trees for indexes, which are optimized for reads. NoSQL: - More scalable because of easier sharding and because nodes can be run on commodity hardware. - Rapid development; migrations are more fluid. - Easier to store large amounts of data that have little to no structure. - Most NoSQL databases use LSM trees for indexes, which are optimized for writes.

Answer 56

- Number of users - Daily/monthly [main function] (posts, tweets, views, etc.) - Utilization of traffic, storage, bandwidth, memory

Answer 57

- Walk through every step of the workflows and make sure they would work with what you described - Standby replicas - Partitioning scheme

Systems Design Flashcards

(83 cards)