Systems Design Flashcards

1
Q

What are the five key characteristics of distributed systems?

A
  1. Scalability
  2. Reliability
  3. Availability
  4. Efficiency
  5. Serviceability or Maintainability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is scalability?

A

The ability of a system to elastically handle more demand without loss of data or performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is horizontal vs vertical scaling?

A

Horizontal scaling: Adding more nodes to parallelize operations.
Vertical scaling: Adding more resources to existing nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What database systems are best for scaling vertically?

A

Most SQL solutions like MySQL, PostgreSQL, etc., because SQL transactions do not lend themselves well to distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What database systems are best for scaling horizontally?

A

Most NoSQL solutions like Cassandra and MongoDB because they lend themselves well to sharding and tend to avoid the integrity constraints of SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is reliability?

A

The probability that a system will fail during a given time period.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When is a system considered reliable?

A

A system is reliable if it keeps delivering its services even when one or more of its components fails.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is availability?

A

The time a system remains operational to perform its required functions, expressed as a percentage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do availability and reliability relate as metrics?

A

Reliability is availability over time with consideration to all possible real-world conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is a reliable system available? What about the converse?

A

If a system is reliable, it is available, but the converse is not necessarily true. A system can be available without being reliable by minimizing downtime, having spare parts and nodes available, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is efficiency?

A

A measure of system performance using one of two metrics: latency, or throughput.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is serviceability or maintainability?

A

The speed with which a system can be repaired or maintained. Failing systems lead to lower availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are six of the most common load balancing algorithms?

A
  1. Least Connection Method
  2. Least Response Time Method
  3. Least Bandwidth Method
  4. Round Robin Method
  5. Weighted Round Robin Method
  6. IP Hash
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you handle load balancer failures?

A

Add a redundant load balancer that takes over when the other one fails. Use multiple A records so that browsers auto-resolve to the first working LB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are three schemes for cache coherence and cache invalidation?

A
  1. Write-through caching
  2. Write-around caching
  3. Write-back caching
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is write-through caching?

A

Writes go to the database and cache at the same time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the advantages and drawbacks of write-through caching?

A

Cache coherence is guaranteed, and no data is lost in any faults, but write latency is higher because of the need for two writes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is write-around caching?

A

Writes go first to the database, then the cache is invalidated and the next request leads to a fetch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the advantages and drawbacks of write-around caching?

A

Reduces latency and the risk of the cache being flooded with writes, but leads to more cache misses which slows down reads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is write-back caching?

A

Writes go first to the cache, are immediately confirmed with the client, then later (perhaps in batches) to the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the advantages and drawbacks of write-back caching?

A

Very fast (perceived) reads and writes for high throughput, but leads to possible cache incoherence, especially if the database ultimately rejects the write.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the six most common cache eviction policies?

A
  1. First In First Out (FIFO)
  2. Last In First Out (LIFO)
  3. Least Recently Used (LRU)
  4. Most Recently Used (MRU)
  5. Least Frequently Used (LFU)
  6. Random Replacement (RR)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the three most popular data partitioning schemes?

A
  1. Horizontal partitioning
  2. Vertical partitioning
  3. Directory-based partitioning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is directory-based partitioning?

A

Combining horizontal and vertical partitioning by using a lookup server that can be dynamically changed to point a DB key/tuple to a new database instance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the four most common data partitioning criteria?

A
  1. Key or hash-based partitioning
  2. List partitioning
  3. Round-robin partitioning
  4. Composite partitioning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the three most common problems with data partioning?

A
  1. Joins and denormalization
  2. Referential integrity
  3. Rebalancing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are four use cases for proxies in handling requests?

A
  1. Filtering requests
  2. Logging requests
  3. Transforming requests (adding/removing headers, encrypting/decrypting, compressing)
  4. Serving the request from its cache (ideally shared across users)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a reverse proxy?

A

Maps a request from one client to one/many servers that serve that request. The resources appear to the client as if they originated from the proxy itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the four most common types of NoSQL databases?

A
  1. Key-value stores
  2. Document databases
  3. Wide-column databases
  4. Graph databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is a document database?

A

A NoSQL database where data is stored in documents which are grouped together in collections (like tables). Unlike rows in SQL, documents can have entirely different schemas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are two commonly used document databases?

A

MongoDB, CouchDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is a wide-column database?

A

A NoSQL database where data is stored in dynamically generated columns and where each row can have any set of columns. Like a spreadsheet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are wide-column and columnar databases best for?

A

Analyzing large datasets, since each column can be analyzed independently and they fit better into pages, caches, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are two commonly used wide-column databases?

A

HBase and Cassandra

35
Q

What is a graph database?

A

A database where relationships are best represented in a graph. There are nodes (entities), metadata about those nodes, and edges that connect those nodes.

36
Q

What are two commonly used graph databases?

A

Neo4j and InfiniteGraph

37
Q

What is ACID short for?

A

Atomicity, Consistency, Isolation, Durability

38
Q

What are the three guarantees in CAP theorem?

A

Consistency, Availability, Partition Tolerance

39
Q

What is CAP theorem?

A

A theorem that states that it is impossible for a system to simultaneously satisfy more than two of the three CAP guarantees.

40
Q

What is consistency in CAP theorem?

A

The guarantee that all nodes have and return the same data.

41
Q

What is availability in CAP theorem?

A

The guarantee that every request gets a success a response.

42
Q

What is partition tolerance in CAP theorem?

A

The guarantee that if there are network faults between nodes, the system will continue to work.

43
Q

How is consistency in CAP theorem achieved?

A

By waiting until writes have been committed to all nodes before allowing subsequent reads.

44
Q

How is availability in CAP theorem achieved?

A

By replicating data across nodes.

45
Q

How is partition tolerance in CAP theorem achieved?

A

By replicating data across nodes and allowing the partitioned networks to continue functioning independently.

46
Q

How does consistent hashing work?

A

Keys are mapped to nodes by hashing them and then moving clockwise on a hash function “ring” until a node is found. The ring can be updated and a region remapped by adding/removing nodes to locations on the ring.

47
Q

Why do we need consistent hashing?

A

We need to add/remove shards to a database to handle changes in load and usage patterns. Without consistent hashing, we’d have to completely remap all keys anytime we add/remove a shard to/from a distributed hash table. With consistent hashing, we only need to remap the keys for the node being added/removed.

48
Q

How do we load balance with consistent hashing?

A

By adding virtual replicas, i.e., adding the same node multiple times to the same hash function ring.

49
Q

What are the four most common ways to implement notifications using the web?

A
  1. Ajax polling
  2. HTTP long-polling
  3. WebSockets
  4. Server-Sent Events (SSE)
50
Q

What are the advantages and drawbacks of Ajax?

A

Easy to implement, but leads to lots of unnecessary empty requests and responses, which create HTTP overhead.

51
Q

What are the advantages and disadvantages of HTTP long-polling?

A

(Advantages?) Requires persistent open connections to the server.

52
Q

What are the advantages and disadvantages of WebSockets?

A

Only mechanism that enables bi-directional communication, and can be easier to work with for persistent connections. Drawbacks are that it’s more resource-intensive and harder to load balance.

53
Q

How do Server-Sent Events work?

A
  1. The client opens a web page from the server
  2. The rendered web page opens a connection to the server
  3. The server streams events/data in an infinite loop to the client
54
Q

What are the seven steps of a systems design interview?

A
  1. Requirements clarification
  2. Back-of-the-envelope estimation
  3. System interface definition
  4. Data modeling
  5. High-level design
  6. Detailed design
  7. Bottleneck analysis
55
Q

What is the difference between functional and non-functional requirements?

A

Functional requirements specify what the system should do, while non-functional requirements specify how the system should do it.

56
Q

What are the three main topics to cover during the requirements clarification?

A
  1. What are the functional (user-facing) requirements?
  2. What are the non-functional (behavioral) requirements like availability, consistency, etc.?
  3. What parts are we building?
57
Q

What are five areas of functional requirements to remember to explore?

A
  1. Notifications
  2. Analytics
  3. Timelines
  4. Media
  5. Expiry
58
Q

What are some use cases for HBase?

A
  • High write throughput with small random reads. Achieves throughput by storing data in memory and flushing to disk periodically.
  • Very large data (backed by HDFS).

(Note: I don’t see why MongoDB or any LSM-tree based solution wouldn’t work well, too.)

59
Q

What are some use cases for Cassandra?

A

In addition to everything that NoSQL offers, Cassandra is particularly good at:

  • High write throughput.
  • Analytics workloads.
  • Very good availability semantics (quorum).

It is weak at:
- Read latency. Put caches in front (like EVCache) to resolve.

60
Q

What is Mesos?

A

An orchestration system that allows both containers and non-containers. Closest analogy is K8s but with non-containers too.

61
Q

What is Hive?

A

A data warehousing layer that goes on top of Hadoop or Spark. Supports HiveSQL which is a flavor of SQL.

62
Q

In what three places can an LB be inserted?

A

Between:

  • User and web layers.
  • Web and application (or cache) layers.
  • Application (or cache) and database layers.
63
Q

What is a CDN and how does it work?

A

Content delivery network. A kind of cache for large sizes or amounts of static media. Sits between the client and the back-end and serves the static resource if available or acquires it from the back-end if the cache misses.

64
Q

What is a regular internet service read/write ratio?

A

100:1 or 1000:1

65
Q

What is a database quorum?

A

Instead of querying a single node, a set of nodes are all queried simultaneously, and the most up-to-date data among them is taken as the response.

66
Q

In CAP, what are the usual choices?

A

Between CP and AP. CA never happens because single points of failures are big red flags for distributed systems.

67
Q

What are the advantages and disadvantages of a job queue vs a synchronous operation?

A

Advantages:
- Scales effectively because more workers can be added to process the jobs.
- Adds fault tolerance and durability to requests because failed jobs can be retried; far more reliable than client-side error checking.
Disadvantages:
- Work is done asynchronously so needs more UI to tell the user what’s happening.
- Adds complexity at small scale.

68
Q

What are the three main guarantees a database can provide with respect to the consistency of a user’s reads and writes?

A
  1. Read-after-write consistency: Once a user writes changes, they should always see them.
  2. Monotonic reads: Once a user has seen one version of data, they should never see an older version of the same data.
  3. Consistent prefix reads: Reads always happen in an order that makes causal sense, e.g., never seeing a reply before a question.
69
Q

What are the exponents of each of the five denominations of sizes? What are some examples of multiplying them together?

A

You can suffix each of these with B, e.g. K -> KB.

  1. K = 10^3 = 1,000
  2. M = 10^6 = 1,000,000
  3. G = 10^9 = 1,000,000,000
  4. T = 10^12 = 1,000,000,000,000
  5. P = 10^15 = 1,000,000,000,000,000

The number of zeros in the unrolled number is equal to the exponent of ten. Add up the exponents when multiplying.

Examples of multiplication:

  1. K * K = 10^3 * 10^3 = 10^6 = M
  2. K * M = 10^3 * 10^6 = 10^9 = G
  3. K * G = 10^3 * 10^9 = 10^12 = T
  4. K * T = 10^3 * 10^12 = 10^15 = P
  5. M * M = 10^6 * 10^6 = 10^12 = T
  6. M * G = 10^6 * 10^9 = 10^15 = P
70
Q

How many concurrent connections can a modern server handle?

A

About 50K.

71
Q

How much data can a typical DB shard store?

A

About 4 TB.

72
Q

What is a classic trick for systems design with regards to number of concurrent users?

A

The number of concurrents on a service changes throughout the day. Add a *10 fudge factor to account for the peaks.

73
Q

What is the cost per gigabyte of storage?

A

$0.1

74
Q

How many seconds are in a year?

A

31M

75
Q

How many seconds are there in a day?

A

606024 ~= 100K

76
Q

What is eventual consistency?

A

The guarantee that a system will self-heal and eventually get back into a consistent state.

77
Q

What is the difference between clustering and sharding?

A

Clustering is decentralized, automatic, and is done by the nodes themselves. They all talk to each other to persist each other’s data.

Sharding is centralized and is done by application or service code. Data is mapped to a shard before the shard is ever accessed, and shards are unaware of each other.

78
Q

When deciding between the two factors of CAP theorem to hold, what are the usual choices?

A

Between CP and AP. CA never happens because single points of failures are big red flags for distributed systems.

79
Q

How can we find one small piece of data among a very large distributed dataset?

A

Usually using an index, or a distributed index. A distributed index is an index that itself is partitioned across nodes. Only needed for secondary indexes because the primary key is already what’s typically used to partition the data. If you don’t have an index on the data, then you would have to make a request to all nodes simultaneously and they would all have to perform full scans.

80
Q

In what two ways can a global cache be placed between two layers?

A
  1. Between the two layers. If the cache misses, then the cache itself is responsible for fetching the resource from the next layer.
    - This option is more common because it prevents a flood of requests from all going onto the next layer.
  2. Next to the two layers. The first layer is responsible for querying the cache, and that layer fetches the resource itself from the second layer if the cache misses.
    - Better if the cache is being used for very large files with low hit percentages, because otherwise with the first option, the cache would be overwhelmed catching up.
    - Also better for files that are stored statically in the cache (no eviction).
    - Application understands usage patterns better than the cache.
81
Q

Why would you use each of a SQL database and a NoSQL database?

A

SQL:

  • Better for consistent schemas (incl. foreign key constraints) and requirements (including volume) that are not changing quickly.
  • Need ACID compliance for strict durability, e.g. in e-commerce and finance.
  • Most SQL databases use B-trees for indexes, which are optimized for reads.

NoSQL:

  • More scalable because of easier sharding and because nodes can be run on commodity hardware.
  • Rapid development; migrations are more fluid.
  • Easier to store large amounts of data that have little to no structure.
  • Most NoSQL databases use LSM trees for indexes, which are optimized for writes.
82
Q

What should you cover in step two, estimations?

A
  • Number of users
  • Daily/monthly [main function] (posts, tweets, views, etc.)
  • Utilization of traffic, storage, bandwidth, memory
83
Q

What should you cover in step six, detailed design?

A
  • Walk through every step of the workflows and make sure they would work with what you described
  • Standby replicas
  • Partitioning scheme