Chapter 2 Flashcards

(43 cards)

1
Q

When does a failure Occur?

A

Failure occurs when a process behaves incorrectly- our unit of failure is the entire process, meaning all components fail together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Crash

A

Process stops completely. Called a crash-fault, and the crash-stop abstraction assumes no recovery after failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Omissions

A

When a process fails to send or receive a message it should

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Crash-Recovery Abstraction, how does it work

A

Processes can crash and later recover.

Each process may use stable storage, which survives crashes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When does eavesdropping faults happen, is this easy to see?

A

Eavesdropping fault happens when a process leaks internal information to an outsider. The attacker can correlate leaked data from multiple processes.

The process still behaves correctly, so this fault is invisible from within the system.

However, attackers must usually run malicious code on the host machine, which may be detected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What and when Arbritrary faults (Byzantine Faults) happen

A

A arbitrary fault is when a process behaves in any unexpected or incorrect way.
Most general and challenging type of failure.

In practice, arbitrary faults may come from:

Malicious attacks

Software bugs

Faulty compilers or hardware
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Q: What are the key characteristics of an asynchronous distributed system?

A

Flashcard 1: Asynchronous Systems
Q: What are the key characteristics of an asynchronous distributed system?
A:

No timing guarantees on processing or communication

No physical clocks

Uses logical clocks (Lamport clocks) to track causality

    Increment on events and message sends

    On receiving a message: clock := max(local, received) + 1

Defines the happened-before relation (→)

Consensus is impossible if even one process can crash

Simple Explanation:
In an asynchronous system, you can’t rely on any sense of time—things might take a short or long time, and you never know. To track the order of events, processes use counters instead of real clocks. But this lack of timing makes it impossible to always reach agreement if one process might fail.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Q: What defines a synchronous distributed system, and what are its advantages?

A

Q: What defines a synchronous distributed system, and what are its advantages?
A:
A synchronous distributed system is one where the following are known and bounded:

The time it takes for a process to perform a step.

The time it takes for a message to travel between processes.

The difference between any two clocks (bounded by a value δ).

Timed failure detection:
If a process crashes, others can detect it within a known amount of time. For example, using a heartbeat system where processes regularly send signals. If a signal is missed, the process is considered crashed.

Message delay estimation:
Since message delays are bounded, we can estimate how long it takes for a message to reach another process, which helps identify slower or more distant links.

Time-based coordination (e.g., leases):
A process can get exclusive rights to perform an action (like writing a file) for a fixed period. After that, the right expires automatically — useful for safe coordination.

Worst-case performance guarantees:
Because everything has upper time bounds, you can predict the maximum time an algorithm takes to complete, even if some messages are lost or delayed (as long as the number is within limits).

Clock synchronization:
All clocks in the system can be kept close to each other — within δ time units — which makes it possible to coordinate actions and use accurate timestamps.

Limitation: Hard to ensure in large, real-world networks like the Internet
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Q: What is a partially synchronous system and why is it practical?

A

A:

Realistic model: Mostly synchronous, but can behave asynchronously at times

Two forms:

    System becomes synchronous after some unknown time

    System has intermittent synchronous periods

Reflects real-world issues (e.g., network overload, GC pauses)

Algorithms can use these synchronous windows to make progress or terminate

Simple Explanation:
Partial synchrony is a middle ground. Most of the time, things work fast and reliably, but sometimes delays or problems happen. If your algorithm is smart, it can wait for the “good times” to finish its job. This model fits most real systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What issue do systems without failure detectors have, and how do they help?

A

Problem: Detecting crashed processes in distributed systems is tricky. Systems can be:

Asynchronous: No timing guarantees (hard to detect failures).

Synchronous: Strict timing (easy detection but impractical).

Partially Synchronous: Eventually stabilizes to synchronous behavior (practical balance).

Solution: Failure detectors abstract timing assumptions. They:

Report suspected crashed processes.

Avoid complex timing code in processes/links.

Are safer (no direct time references) and simpler.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Q: What is a failure detector in distributed systems and why is it useful?

A

A:

A failure detector provides (possibly inaccurate) info about which processes have crashed or are correct.

It captures timing assumptions from synchronous or partially synchronous systems.

Keeps process and link abstractions simple by not adding timing directly.

Allows reasoning about failures without relying on exact physical time.

Works well for crash faults by checking if processes stop performing expected actions.

Not reliable for Byzantine faults because malicious processes can hide failures.

Byzantine failure detectors are difficult to implement and less useful.

Simple Explanation:
A failure detector guesses which processes have crashed by monitoring their actions, helping manage failures without complex timing rules. It works best when processes just crash, not when they behave maliciously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Q: What is a perfect failure detector and how does it work in synchronous systems?

A

A:

Detects crashed processes without false positives.

Eventually detects every crashed process permanently.

Uses timeouts based on known worst-case delays.

Sends heartbeat messages and expects timely replies.

Declares a process crashed if no reply arrives before timeout.

Assumes reliable communication and known timing bounds.

Simple Explanation:
A perfect failure detector reliably tells when a process crashes by waiting for timely responses. If no reply arrives within a set time, it assumes the process crashed. It never makes mistakes but depends on strict timing guarantees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Q: What is leader election in distributed systems and how does it work?

A

A:

Leader election chooses one correct process as a unique leader.

If the current leader crashes, a new leader is eventually elected.

Ensures at most one leader at a time (accuracy).

Useful in systems like primary-backup replication.

Can be implemented using a perfect failure detector and process ranking.

Higher-ranked processes become leader only if all higher ranks have crashed.

Simple Explanation:
Leader election picks one process to coordinate the group. If that leader fails, the next highest-ranked process takes over. This ensures the system always has a single trusted leader.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Q: What is an eventually perfect failure detector and how does it work?

A

A:

Detects crashed processes accurately only after some unknown time.

May suspect correct processes temporarily (false suspicions) before that time.

Uses timeouts and heartbeat messages like perfect failure detectors.

When a suspected process responds, it is restored and the timeout is increased.

Guarantees eventual strong accuracy: after some point, no correct process is suspected.

Ensures strong completeness: all crashed processes are eventually suspected permanently.

Designed for partially synchronous systems with unknown timing bounds.

Simple Explanation:
An eventually perfect failure detector may make mistakes at first by wrongly suspecting healthy processes. But over time, it learns to adjust its timing and stops false alarms, reliably detecting actual crashes in systems with unpredictable delays.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Q: What is an eventual leader detector (Ω) and what are its key properties?

A

A:

Eventual leader detector (Ω) ensures that eventually all correct processes agree on the same correct leader.

Leaders may change arbitrarily for some time before stabilizing.

Two key properties:

    Eventual accuracy: After some time, every correct process trusts some correct leader.

    Eventual agreement: After some time, no two correct processes trust different leaders.

Useful in consensus algorithms when perfect leader election is impossible.

Implementations include:

    Monarchical Eventual Leader Detection: uses failure detector QP and trusts highest-ranked non-suspected process.

    Elect Lower Epoch: uses epoch numbers to pick the leader that has crashed the least, works under crash-recovery.

Simple Explanation:
An eventual leader detector helps all non-faulty processes pick one leader after some time, even if leadership changes unpredictably at first. It’s a practical way to coordinate in unreliable systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Q: What is the Byzantine eventual leader-detector abstraction and how does it work?

A

A:

It “trusts but verifies” leaders: a newly elected leader is given time to perform its task.

Other processes monitor the leader’s actions and can complain if it fails or is too slow.

If more than f correct processes complain about a leader, a new leader is elected.

Leaders get progressively more time to complete their tasks in an eventually synchronous system.

Requires input from the higher-level algorithm to judge leader’s performance.

Key properties:

    Eventual succession: faulty leaders are eventually replaced after enough complaints.

    Putsch resistance: leaders are only removed if at least one correct process complains.

    Eventual agreement: eventually all correct processes agree on the same leader.

Implemented by rotating leaders deterministically in rounds (leader = process with rank = round mod N).

Complaints are broadcast and collected; once >2f complaints are received, the system moves to the next leader.

Simple Explanation:
This leader detector picks leaders in turn and watches if they do their job well. If many honest processes complain about a leader, it gets replaced. This way, the system can handle malicious leaders but still eventually agree on a good one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Q: How is each process modeled in a distributed system?

A

A: As a state machine (automaton) that independently executes the same algorithm, performing steps such as receiving messages, local computation, and sending messages.

Simple Explanation:
Each process is like a little robot following the same instructions but working on its own, reacting to messages and doing work step-by-step.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Q: What is the timing model assumed in distributed systems?

A

A: A global clock ticks regularly, with each process performing exactly one step per tick, and steps are sequenced even if they happen simultaneously in real time.

Simple Explanation:
Imagine all processes have synchronized watches and take turns acting one step at a time, even if multiple things happen “at once.”

19
Q

Q: What are the two types of events in distributed timing, and how do they differ?

A

Local events: Fast, happen inside one process.

Communication events: Slower, involve network delays and dominate performance.

Simple Explanation:
Doing stuff inside your computer is quick; talking to other computers takes longer and slows things down.

20
Q

Q: What does safety mean in distributed algorithms?

A

A: “Nothing bad happens” — the system never performs an incorrect action (e.g., no fake or duplicated messages, messages arrive in order).

Simple Explanation:
The system should never mess up or do anything wrong that breaks the rules.

21
Q

Q: What does liveness mean in distributed algorithms?

A

A: “Something good eventually happens” — the system continues to make progress (e.g., messages eventually arrive, requests eventually get responses).

Simple Explanation:
The system keeps working and making progress, it doesn’t get stuck forever.

22
Q

Q: Why must distributed systems guarantee both safety and liveness?

A

Simple Explanation:
You want a system that never breaks (safety) but also keeps moving forward (liveness).

23
Q

Q: What safety and liveness properties does a reliable message link guarantee?

A

A: Safety: No duplicates and correct order.
Liveness: Messages are eventually delivered.

Simple Explanation:
Messages get to the right place, only once, and eventually.

24
Q

Q: What three abstractions combine to form a distributed-system model?

A

A:

Process abstraction: How processes behave and fail (e.g., crash, recover, or act maliciously).

Link abstraction: How messages are delivered between processes, including guarantees like reliability, ordering, or loss.

Failure-detector abstraction (optional): How processes detect or suspect failures of others, often imperfect or delayed.

Simple Explanation:
A distributed-system model sets the rules for how processes work, how they communicate, and how they detect failures. These assumptions help design and analyze distributed algorithms.

25
Q: What characterizes the Fail-Stop distributed system model?
A: Processes crash and halt permanently. Perfect links guarantee reliable message delivery, and perfect failure detectors detect failures immediately and accurately. Simple Explanation: Processes either work correctly or stop forever, and the system instantly knows if a process has failed, making it easy to handle failures.
26
Q: How does the Fail-Noisy model differ from Fail-Stop?
A: It uses an eventually perfect failure detector, which may make mistakes initially by suspecting healthy processes but eventually correctly identifies all crashed processes. This models real systems with network delays and temporary false suspicions. Also eventual leader detection Paxos Simple Explanation: The system might mistakenly think a process is down when it's just slow, but over time, it figures out who really crashed.
27
Q: What is the Fail-Silent model?
A: Processes crash-stop but there is no failure detection mechanism. Algorithms must work without knowing if other processes are slow or crashed, making it harder to distinguish failures from delays. No timing assumptions Simple Explanation: Processes could be down or just slow, and the system has no way to tell which is true.
28
Q: What defines the Fail-Recovery model?
A: Processes can crash and later recover. Stubborn links guarantee eventual delivery of messages if both sender and receiver are up. The system must handle recovery, including lost in-memory state and possible message duplication. Simple Explanation: Processes can reboot after crashing, and messages keep trying to get through as long as both sides are running.
29
Q: What are the key challenges in the Fail-Arbitrary (Byzantine) model?
A: Processes may behave arbitrarily, including sending false or conflicting messages. Secure authenticated links with cryptographic methods are required to detect tampering. Designing algorithms is complex due to malicious behavior. No timing assumptions Simple Explanation: Some processes might lie or cheat, so the system uses cryptographic tools to detect and prevent bad behavior.
30
Q: What is the purpose of the Randomized model in distributed systems?
A: Uses randomness (e.g., coin flips) to solve problems that deterministic algorithms cannot solve under certain conditions. Guarantees progress with high probability rather than certainty. Simple Explanation: The system sometimes makes random choices to work around impossible problems and usually makes progress.
31
Q: What properties do perfect links guarantee?
A: Messages are reliably delivered if sender and receiver are correct, no duplicates occur, and no fake messages are created. Simple Explanation: Messages arrive safely, only once, and only if they were actually sent.
32
Q: How are perfect links implemented?
A: Built on stubborn links that keep retransmitting messages until confirmation. Each process records delivered messages to avoid duplicates. Simple Explanation: Keep sending a message until the other side confirms receipt, and remember what’s already been delivered to avoid duplicates.
33
Q: If neither a→ba→b nor b→ab→a, then how are events aa and bb related?
Simple Explanation: If neither event happens before the other, they are happening independently at the same time (concurrently). Why not the others? A. Parallel – Refers to physical execution on separate processors, not necessarily related to event ordering. B. Sequential – Would require a defined order like a→ba→b or b→ab→a. D. Linear – Implies a total order of events; concurrency violates this. E. Non-uniform – Not related to event ordering; more relevant to consistency or delivery properties in distributed systems.
34
Q: What is pipelining in distributed systems?
A: Pipelining is an optimization technique where data is processed in stages connected in series. Each stage can work in parallel, with the output of one stage becoming the input of the next. This improves performance by taking advantage of network parallelism. Simple explanation: It's like an assembly line—each part of the system works on a piece of the job at the same time, speeding things up.
35
Q: What are the advantages and disadvantages of pipelining in Multi-Paxos?
A: Advantage: Multiple instances can be proposed and decided in parallel, improving throughput. Disadvantage: Starting too many instances at once can overload the system, so the number of concurrent instances must be limited (using a parameter α). Simple explanation: It helps do more tasks at once, but if you do too many, the system can slow down. So you need to limit how much is done at the same time.
36
Q: What are Stubborn Links and their key properties?
Definition: Implements persistent message retransmission. Properties: Stubborn Delivery: Messages sent by correct processes are delivered infinitely often. No Creation: Messages delivered must have been sent. Used in: Crash-stop models. Algorithm: "Retransmit Forever" (resends messages periodically). Practical Use: Not efficient; used for teaching. Simple Explanation: Stubborn Links keep resending messages forever to ensure they eventually get through, even if the network is unreliable. It’s like shouting the same message repeatedly until someone confirms they heard it. Not practical for real systems because it wastes resources, but helps explain retransmission basics.
37
Q: How does the "Retransmit Forever" algorithm work?
A: Mechanism: Maintains sent set of all messages. Periodically retransmits every message in sent. Uses fair-loss links (fll) for actual sending. Guarantees: If sender is correct, receiver eventually gets the message. Limitation: Inefficient due to infinite retransmissions. Simple Explanation: This algorithm works like a broken record: it keeps sending every message over and over. Even if some sends fail, the receiver will eventually get the message. But it’s like mailing a letter every day forever—works but uses too much effort.
38
Q: What are Perfect Links and their properties?
A: Definition: Ensures messages are delivered exactly once. Properties: Reliable Delivery: Correct sender → correct receiver gets message. No Duplication: No repeated deliveries. No Creation. Algorithm: "Eliminate Duplicates" (checks delivered set). Use Case: Standard in crash-stop systems. Simple Explanation: Perfect Links guarantee no lost or duplicate messages. Think of it as a postal service that never loses a package and never delivers the same one twice. Used in systems where accuracy matters more than speed.
39
Q: How does "Eliminate Duplicates" ensure no duplication?
A: Steps: Uses stubborn links (sl) to send messages. Keeps a delivered set of received messages. Delivers a message only if not in delivered. Correctness: Duplicates are filtered via the set check. Issue: delivered grows indefinitely (solved with timestamps/acks in practice). Simple Explanation: This method uses a checklist of received messages. If a message is already on the list, it’s ignored. Like crossing off items on a to-do list—once done, you don’t redo them. Downside: the list gets huge over time.
40
Q: How do Logged Perfect Links handle crash-recovery?
A: Mechanism: Stores delivered in stable storage. On recovery, retrieves delivered from storage. Logs messages to avoid duplicates after crashes. Event: Triggers Deliver with log variable name. Guarantee: Survives process crashes and restarts. Simple Explanation: Logged Perfect Links save received messages to a "hard drive" instead of memory. If the system crashes and restarts, it checks the saved list to avoid re-delivering old messages. Like writing down tasks in a notebook so you don’t forget after a power outage.
41
Q: What problem do Authenticated Perfect Links solve?
A: Problem: Prevents message forgery in Byzantine systems. Solution: Uses MACs (Message Authentication Codes). Algorithm: "Authenticate and Filter": Sender adds MAC to message. Receiver verifies MAC before delivery. Guarantees: Authenticity (only correct senders can forge messages). Simple Explanation: Authenticated Links stop hackers from faking messages. They use a secret code (like a wax seal on a letter) to prove the sender’s identity. If the seal is broken or missing, the message is rejected.
42
Q: What practical issues affect link abstractions?
A: Network Topology: LAN vs. WAN latency differences. Flow Control: Adjust sending rate to receiver capacity. Heterogeneity: Resource disparities (CPU, bandwidth). State Growth: Managing ever-growing delivered sets. Note: Lower layers (e.g., TCP) often handle retransmission. Simple Explanation: Real-world networks aren’t perfect. Fast local networks behave differently than slow internet links. Systems also need to avoid overwhelming receivers, handle uneven resources, and manage memory. Luckily, tools like TCP handle some issues automatically.
43
Q: Why use MACs in Authenticated Perfect Links?
A: Purpose: Ensure authenticity and integrity. Process: Sender computes MAC using shared key. Receiver verifies MAC matches sender and message. Result: Byzantine nodes can’t forge messages between correct nodes. Simple Explanation: MACs act like a signature to prove a message’s sender and that it wasn’t tampered with. Imagine a tamper-proof seal on a food package—if broken, you know it’s unsafe.