Module 10a - Fault Tolerance Flashcards
Describe the concept of Availability for fault tolerance in distributed computing
The system should operate correctly at any given instant in time.
Ex: a real system may be 99% available
Describe the concept of Reliability in distributed computing
The system should run continuously without interruption.
Ex: A real system may have a mean time between failures (MTBF) of one month
Describe the concept of Safety in distributed computing
Failure of the system should not have catastrophic consequences.
Ex: your car can still come to a complete stop if the ABS fails
Describe the concept of Maintainability in distributed computing
A failed system should be easy to repair.
Ex: disks can be replaced easily in a RAID
Define the term “error” in distributed computing
Error: A part of a system’s state that might lead to a failure.
Ex: dropped or damaged network packet
A ____ may lead to an _____ which may lead to a _____
fault
error
failure
Define the term “fault” in distributed computing
Fault: The cause of an error
Ex: When a person talking on the phone walks into an elevator
What are the 3 types of faults?
Transient faults
Intermittent faults
Permanent faults
What is a transient fault?
Transient faults occur once and then disappears
Ex: a bird flies in front of a microwave receiver
What is an intermittent fault?
Intermittent faults occur, vanish, then reappear. They are difficult to debug.
What is a permanent fault?
Permanent faults occur and will continue to exist until a faulty component is replaced.
Ex: burnt out power supply in a server
What are the 5 types of failures in distributed systems?
Crash failure Omission failure Timing failure Response failure Arbitrary failure
What is a Crash failure?
A server halts, but is working correctly until it halts
What is an Omission failure?
A server fails to respond to incoming requests
What is a Timing failure?
A server’s response lies outside the specified time interval
What is a Response failure?
A server’s response is incorrect
What is an Arbitrary Failure?
A server may produce arbitrary responses at arbitrary times
In Distributed Systems, we mask failures using _______. One example of this is _______ computing units
redundancy
replicating
A software technique for providing redundancy is to create a group of redundant identical processes. They can be classified as a Flat group, or a Hierarchical group.
Describe both of these paradigms.
Flat Group:
All processes behave in the same way. They are simply replicas of each other.
Hierarchical Group:
There is 1 coordinator processes, and numerous worker processes.
What are Flat groups? how can you implement them?
All processes play an equal role, there is no concept of a primary, or a backup.
Implemented using quorum-replications.
What are Hierarchical groups?
There is a distinguished primary/coordinator node, which coordinates the actions of the other nodes which are backup/worker nodes.
Implemented using primary-backup server. Consensus problem is required to be solved for this
What is the consensus problem?
- We assume that each process has a procedure
propose(val)
and a proceduredecide()
- First, each process proposes a value by calling the propose function once and specifying a value at the initial state of the process
- Next, each process learns the value agreed upon by calling the decide() function
3 friends are trying to figure out what to do on a Friday night. Their proposals are all different activities:
Friend #1: proposes to go see a movie
Friend #2: proposes to go to a restaurant
Friend #3: proposes to sit at home
Next, the 3 friends all decide and go to the movie.
This is an analogy of what type of problem?
The Consensus Problem
In the consensus problem, there are 2 safety properties: Agreement, and Validity. Describe them.
Agreement: Two calls to decide() never return different values
Validity: If a process calls decide() with response x, then some process must have invoked a call to propose(x)