Foundational Distributed System Concepts -- Availability and Reliability Flashcards
Study concepts (38 cards)
What are the types of replication?
Master-slave, multi-master, quorum-based, leaderless
These replication types refer to how data is copied and synchronized across different nodes in a distributed system.
What is the difference between synchronous and asynchronous replication?
Synchronous replication requires data to be written to all nodes at once; asynchronous allows for delays in data propagation
Synchronous replication ensures immediate consistency, whereas asynchronous may lead to temporary inconsistencies.
What does the CAP Theorem stand for?
Consistency, Availability, Partition tolerance
The CAP Theorem states that a distributed data store can only guarantee two of the three properties at the same time.
What are the implications of the CAP Theorem?
Trade-offs between consistency, availability, and partition tolerance in distributed systems
Understanding the CAP Theorem helps in designing systems based on their requirements for consistency, availability, or tolerance to network partitions.
Give an example of a system that prioritizes consistency.
ACID databases
ACID (Atomicity, Consistency, Isolation, Durability) databases focus on ensuring reliable transactions.
Give an example of a system that prioritizes availability.
Eventually consistent NoSQL databases
These databases prioritize being available for reads and writes, even if they are not immediately consistent.
What is fault tolerance?
The ability of a system to continue operating in the event of a failure
Fault tolerance is crucial for maintaining service availability and reliability.
List some strategies for achieving fault tolerance.
- Redundancy
- Failover
- Self-healing
- Circuit breakers
- Retries
These strategies help systems handle failures gracefully and maintain operational stability.
Why is data replication fundamental in distributed systems? What problems does it solve?
High Availability (HA): If one node fails, others can continue serving requests.
Fault Tolerance: The system can withstand node or network failures.
Read Scalability: Distributing read requests across multiple replicas.
Lower Latency: Placing data geographically closer to users.
Disaster Recovery: Protecting data against regional outages.
What dictates the the selection of a replication strategy?
CAP theorem. You generally have to choose two out of three, or more practically, decide on the degree of each.
Describe how Leader-Follower Replication (Primary-Replica/Master-Slave) works
Most common and often simplest. How it works:
1. Writes: Clients send all write requests to the leader.
2. Replication Log: The leader records changes (e.g., in a write-ahead log, binary log, or sequence of operations).
3. Propagation: The leader sends this log of changes to its followers.
4. Application: Followers apply these changes in the same order as the leader.
5. Reads: Reads can be served by the leader or any of the followers.
What are the types of leader-follower replications?
Asynchronous, Synchronous and Semi-Synchronous Replications.
- Asynchronous Replication: The leader writes to its local storage, acknowledges the client, and then replicates changes to followers in the background
- Synchronous Replication: The leader waits for at least one (or a configured number) of followers to acknowledge receipt of the write before confirming success to the client.
- Semi-Synchronous Replication: A hybrid approach. The leader waits for at least one follower to acknowledge receipt (but not necessarily commit) of the write, and then acknowledges the client. It provides a good balance between performance and durability.
What are the Pros, Cons, and Use Cases of Asynchronous Leader-Follower Replication?
Pros: Low write latency for the client, high write throughput for the leader.
Cons: Potential for data loss if the leader crashes before changes are replicated to followers (RPO > 0). Reads from followers might be stale (eventual consistency).
Use Cases: Most common, suitable for many web applications where slight staleness is acceptable (e.g., social media feeds, most e-commerce product catalogs).
What are the Pros, Cons, and Use Cases of Synchronous Leader-Follower Replication?
Pros: Stronger consistency guarantees (lower RPO, can guarantee linearizability if reads go to the leader and a quorum is used). Less chance of data loss.
Cons: Higher write latency, reduced write throughput (leader blocks until acknowledgment). If a follower fails, it can block writes.
Use Cases: Financial transactions, critical data where data loss is unacceptable, systems requiring strong consistency (e.g., databases for banking ledgers).
What are the Pros of Leader-Follower Replication?
Simplicity: Easier to reason about consistency because there’s a single source of truth for writes.
Strong Consistency (with synchronous replication/leader reads): By directing all reads to the leader, or by using synchronous replication, strong consistency can be achieved.
Read Scalability: Easy to scale reads by adding more followers.
Good for Read-Heavy Workloads: Can offload read traffic from the leader.
Conflict Avoidance: No write conflicts, as only the leader writes.
What are the Cons of Leader-Follower Replication?
Single Point of Failure for Writes (Leader): If the leader fails, writes are blocked until a new leader is elected. This involves leader election, which itself often uses a consensus algorithm (like Raft or Paxos) to ensure all nodes agree on the new leader.
Replication Lag: Asynchronous replication introduces a delay (lag) between the leader and followers, leading to stale reads if clients read from followers.
Leader Bottleneck: All writes must go through the leader, which can become a bottleneck for write-heavy workloads or very high write throughput requirements.
Failover Complexity: Manual or automatic failover mechanisms are required to promote a new leader, which adds operational complexity and can lead to downtime during election.
What are the Tradeoffs of Leader-Follower Replication?
Consistency vs. Availability: Synchronous offers higher consistency but lower availability/latency; asynchronous offers higher availability/latency but lower consistency.
Read Scalability vs. Write Throughput: Excellent read scalability, but write throughput is limited by the single leader.
When do you use Leader-Follower Replication?
Most common for traditional RDBMS (MySQL, PostgreSQL, Oracle Data Guard).
Read-heavy workloads where strong consistency is desired for writes but eventual consistency for reads is acceptable (or reads from leader for stronger consistency).
Applications with clear transactional boundaries.
Simpler systems where operational complexity needs to be minimized.
Describe how Mutli-Master (Active-Active) Replication works
Writes: Clients can send writes to any of the designated master nodes.
Inter-Master Replication: Each master replicates its changes to all other masters. This can be synchronous or asynchronous.
Conflict Resolution: This is the most significant challenge. If the same data is modified concurrently on different masters, conflicts arise and must be resolved.
What are the conflict resolution strategies used in Multi-Master Replication?
Last Write Wins (LWW): The write with the most recent timestamp (or version number) wins. Simple but can lead to data loss.
Merge Operations: For certain data types (e.g., sets, lists), changes can be merged.
Application-Specific Logic: The application provides custom logic to resolve conflicts.
Conflict Avoidance: Design the system to ensure that different masters rarely (or never) write to the exact same data items simultaneously (e.g., partition data by geographic region, user ID). This is often the most practical approach.
What are the Pros of Multi-Master Replication?
High Write Availability: No single point of failure for writes; if one master goes down, others can continue.
Low Write Latency (Geo-distributed): Clients can write to the nearest master, reducing latency in global deployments.
Improved Write Throughput: Writes can be distributed across multiple masters.
Disaster Recovery: If an entire data center fails, other data centers can continue operating.
What are the Cons of Multi-Master Replication?
Complex Conflict Resolution: The biggest challenge. Designing, implementing, and debugging conflict resolution logic is hard and error-prone.
Data Inconsistency Risk: Unless strong synchronous replication and coordination are used (which negates many benefits), there’s a higher risk of temporary data inconsistencies due to concurrent writes and replication lag.
Increased Operational Complexity: More difficult to set up, monitor, and troubleshoot than single-leader systems.
Circular Replication Problems: In some topologies, changes can loop back or cause endless replication.
What are the Tradeoffs in Multi-Master Replication?
Consistency vs. Availability/Performance: Prioritizes availability and potentially write performance over strong consistency. Conflict resolution is often a step towards eventual consistency.
Complexity: Significantly higher complexity in design and operation.
When do you use Multi-Master Replication?
Global applications requiring low write latency for users in different geographical regions (e.g., online collaborative editing, distributed social media where users mostly modify their own data).
When high write availability across multiple sites is paramount, and the application can tolerate or effectively handle eventual consistency and conflicts.
Not ideal for systems requiring strict ACID transactions across distributed masters without significant additional coordination (like distributed transactions, which add complexity and latency).