Chapter 2 Flashcards
(43 cards)
When does a failure Occur?
Failure occurs when a process behaves incorrectly- our unit of failure is the entire process, meaning all components fail together.
Crash
Process stops completely. Called a crash-fault, and the crash-stop abstraction assumes no recovery after failure.
Omissions
When a process fails to send or receive a message it should
Crash-Recovery Abstraction, how does it work
Processes can crash and later recover.
Each process may use stable storage, which survives crashes
When does eavesdropping faults happen, is this easy to see?
Eavesdropping fault happens when a process leaks internal information to an outsider. The attacker can correlate leaked data from multiple processes.
The process still behaves correctly, so this fault is invisible from within the system.
However, attackers must usually run malicious code on the host machine, which may be detected.
What and when Arbritrary faults (Byzantine Faults) happen
A arbitrary fault is when a process behaves in any unexpected or incorrect way.
Most general and challenging type of failure.
In practice, arbitrary faults may come from:
Malicious attacks Software bugs Faulty compilers or hardware
Q: What are the key characteristics of an asynchronous distributed system?
Flashcard 1: Asynchronous Systems
Q: What are the key characteristics of an asynchronous distributed system?
A:
No timing guarantees on processing or communication No physical clocks Uses logical clocks (Lamport clocks) to track causality Increment on events and message sends On receiving a message: clock := max(local, received) + 1 Defines the happened-before relation (→) Consensus is impossible if even one process can crash
Simple Explanation:
In an asynchronous system, you can’t rely on any sense of time—things might take a short or long time, and you never know. To track the order of events, processes use counters instead of real clocks. But this lack of timing makes it impossible to always reach agreement if one process might fail.
Q: What defines a synchronous distributed system, and what are its advantages?
Q: What defines a synchronous distributed system, and what are its advantages?
A:
A synchronous distributed system is one where the following are known and bounded:
The time it takes for a process to perform a step. The time it takes for a message to travel between processes. The difference between any two clocks (bounded by a value δ).
Timed failure detection:
If a process crashes, others can detect it within a known amount of time. For example, using a heartbeat system where processes regularly send signals. If a signal is missed, the process is considered crashed.
Message delay estimation:
Since message delays are bounded, we can estimate how long it takes for a message to reach another process, which helps identify slower or more distant links.
Time-based coordination (e.g., leases):
A process can get exclusive rights to perform an action (like writing a file) for a fixed period. After that, the right expires automatically — useful for safe coordination.
Worst-case performance guarantees:
Because everything has upper time bounds, you can predict the maximum time an algorithm takes to complete, even if some messages are lost or delayed (as long as the number is within limits).
Clock synchronization:
All clocks in the system can be kept close to each other — within δ time units — which makes it possible to coordinate actions and use accurate timestamps.
Limitation: Hard to ensure in large, real-world networks like the Internet
Q: What is a partially synchronous system and why is it practical?
A:
Realistic model: Mostly synchronous, but can behave asynchronously at times Two forms: System becomes synchronous after some unknown time System has intermittent synchronous periods Reflects real-world issues (e.g., network overload, GC pauses) Algorithms can use these synchronous windows to make progress or terminate
Simple Explanation:
Partial synchrony is a middle ground. Most of the time, things work fast and reliably, but sometimes delays or problems happen. If your algorithm is smart, it can wait for the “good times” to finish its job. This model fits most real systems.
What issue do systems without failure detectors have, and how do they help?
Problem: Detecting crashed processes in distributed systems is tricky. Systems can be:
Asynchronous: No timing guarantees (hard to detect failures). Synchronous: Strict timing (easy detection but impractical). Partially Synchronous: Eventually stabilizes to synchronous behavior (practical balance).
Solution: Failure detectors abstract timing assumptions. They:
Report suspected crashed processes. Avoid complex timing code in processes/links. Are safer (no direct time references) and simpler.
Q: What is a failure detector in distributed systems and why is it useful?
A:
A failure detector provides (possibly inaccurate) info about which processes have crashed or are correct. It captures timing assumptions from synchronous or partially synchronous systems. Keeps process and link abstractions simple by not adding timing directly. Allows reasoning about failures without relying on exact physical time. Works well for crash faults by checking if processes stop performing expected actions. Not reliable for Byzantine faults because malicious processes can hide failures. Byzantine failure detectors are difficult to implement and less useful.
Simple Explanation:
A failure detector guesses which processes have crashed by monitoring their actions, helping manage failures without complex timing rules. It works best when processes just crash, not when they behave maliciously.
Q: What is a perfect failure detector and how does it work in synchronous systems?
A:
Detects crashed processes without false positives. Eventually detects every crashed process permanently. Uses timeouts based on known worst-case delays. Sends heartbeat messages and expects timely replies. Declares a process crashed if no reply arrives before timeout. Assumes reliable communication and known timing bounds.
Simple Explanation:
A perfect failure detector reliably tells when a process crashes by waiting for timely responses. If no reply arrives within a set time, it assumes the process crashed. It never makes mistakes but depends on strict timing guarantees.
Q: What is leader election in distributed systems and how does it work?
A:
Leader election chooses one correct process as a unique leader. If the current leader crashes, a new leader is eventually elected. Ensures at most one leader at a time (accuracy). Useful in systems like primary-backup replication. Can be implemented using a perfect failure detector and process ranking. Higher-ranked processes become leader only if all higher ranks have crashed.
Simple Explanation:
Leader election picks one process to coordinate the group. If that leader fails, the next highest-ranked process takes over. This ensures the system always has a single trusted leader.
Q: What is an eventually perfect failure detector and how does it work?
A:
Detects crashed processes accurately only after some unknown time. May suspect correct processes temporarily (false suspicions) before that time. Uses timeouts and heartbeat messages like perfect failure detectors. When a suspected process responds, it is restored and the timeout is increased. Guarantees eventual strong accuracy: after some point, no correct process is suspected. Ensures strong completeness: all crashed processes are eventually suspected permanently. Designed for partially synchronous systems with unknown timing bounds.
Simple Explanation:
An eventually perfect failure detector may make mistakes at first by wrongly suspecting healthy processes. But over time, it learns to adjust its timing and stops false alarms, reliably detecting actual crashes in systems with unpredictable delays.
Q: What is an eventual leader detector (Ω) and what are its key properties?
A:
Eventual leader detector (Ω) ensures that eventually all correct processes agree on the same correct leader. Leaders may change arbitrarily for some time before stabilizing. Two key properties: Eventual accuracy: After some time, every correct process trusts some correct leader. Eventual agreement: After some time, no two correct processes trust different leaders. Useful in consensus algorithms when perfect leader election is impossible. Implementations include: Monarchical Eventual Leader Detection: uses failure detector QP and trusts highest-ranked non-suspected process. Elect Lower Epoch: uses epoch numbers to pick the leader that has crashed the least, works under crash-recovery.
Simple Explanation:
An eventual leader detector helps all non-faulty processes pick one leader after some time, even if leadership changes unpredictably at first. It’s a practical way to coordinate in unreliable systems.
Q: What is the Byzantine eventual leader-detector abstraction and how does it work?
A:
It “trusts but verifies” leaders: a newly elected leader is given time to perform its task. Other processes monitor the leader’s actions and can complain if it fails or is too slow. If more than f correct processes complain about a leader, a new leader is elected. Leaders get progressively more time to complete their tasks in an eventually synchronous system. Requires input from the higher-level algorithm to judge leader’s performance. Key properties: Eventual succession: faulty leaders are eventually replaced after enough complaints. Putsch resistance: leaders are only removed if at least one correct process complains. Eventual agreement: eventually all correct processes agree on the same leader. Implemented by rotating leaders deterministically in rounds (leader = process with rank = round mod N). Complaints are broadcast and collected; once >2f complaints are received, the system moves to the next leader.
Simple Explanation:
This leader detector picks leaders in turn and watches if they do their job well. If many honest processes complain about a leader, it gets replaced. This way, the system can handle malicious leaders but still eventually agree on a good one.
Q: How is each process modeled in a distributed system?
A: As a state machine (automaton) that independently executes the same algorithm, performing steps such as receiving messages, local computation, and sending messages.
Simple Explanation:
Each process is like a little robot following the same instructions but working on its own, reacting to messages and doing work step-by-step.
Q: What is the timing model assumed in distributed systems?
A: A global clock ticks regularly, with each process performing exactly one step per tick, and steps are sequenced even if they happen simultaneously in real time.
Simple Explanation:
Imagine all processes have synchronized watches and take turns acting one step at a time, even if multiple things happen “at once.”
Q: What are the two types of events in distributed timing, and how do they differ?
Local events: Fast, happen inside one process.
Communication events: Slower, involve network delays and dominate performance.
Simple Explanation:
Doing stuff inside your computer is quick; talking to other computers takes longer and slows things down.
Q: What does safety mean in distributed algorithms?
A: “Nothing bad happens” — the system never performs an incorrect action (e.g., no fake or duplicated messages, messages arrive in order).
Simple Explanation:
The system should never mess up or do anything wrong that breaks the rules.
Q: What does liveness mean in distributed algorithms?
A: “Something good eventually happens” — the system continues to make progress (e.g., messages eventually arrive, requests eventually get responses).
Simple Explanation:
The system keeps working and making progress, it doesn’t get stuck forever.
Q: Why must distributed systems guarantee both safety and liveness?
Simple Explanation:
You want a system that never breaks (safety) but also keeps moving forward (liveness).
Q: What safety and liveness properties does a reliable message link guarantee?
A: Safety: No duplicates and correct order.
Liveness: Messages are eventually delivered.
Simple Explanation:
Messages get to the right place, only once, and eventually.
Q: What three abstractions combine to form a distributed-system model?
A:
Process abstraction: How processes behave and fail (e.g., crash, recover, or act maliciously). Link abstraction: How messages are delivered between processes, including guarantees like reliability, ordering, or loss. Failure-detector abstraction (optional): How processes detect or suspect failures of others, often imperfect or delayed.
Simple Explanation:
A distributed-system model sets the rules for how processes work, how they communicate, and how they detect failures. These assumptions help design and analyze distributed algorithms.