Module 10a - Fault Tolerance Flashcards by Alex Jabbour

Describe the concept of Availability for fault tolerance in distributed computing

The system should operate correctly at any given instant in time.
Ex: a real system may be 99% available

How well did you know this?

Not at all

Perfectly

Describe the concept of Reliability in distributed computing

The system should run continuously without interruption.

Ex: A real system may have a mean time between failures (MTBF) of one month

How well did you know this?

Not at all

Perfectly

Describe the concept of Safety in distributed computing

Failure of the system should not have catastrophic consequences.
Ex: your car can still come to a complete stop if the ABS fails

How well did you know this?

Not at all

Perfectly

Describe the concept of Maintainability in distributed computing

A failed system should be easy to repair.

Ex: disks can be replaced easily in a RAID

How well did you know this?

Not at all

Perfectly

Define the term “error” in distributed computing

Error: A part of a system’s state that might lead to a failure.
Ex: dropped or damaged network packet

How well did you know this?

Not at all

Perfectly

A ____ may lead to an _____ which may lead to a _____

fault
error
failure

How well did you know this?

Not at all

Perfectly

Define the term “fault” in distributed computing

Fault: The cause of an error
Ex: When a person talking on the phone walks into an elevator

How well did you know this?

Not at all

Perfectly

What are the 3 types of faults?

Transient faults
Intermittent faults
Permanent faults

How well did you know this?

Not at all

Perfectly

What is a transient fault?

Transient faults occur once and then disappears

Ex: a bird flies in front of a microwave receiver

How well did you know this?

Not at all

Perfectly

What is an intermittent fault?

Intermittent faults occur, vanish, then reappear. They are difficult to debug.

How well did you know this?

Not at all

Perfectly

What is a permanent fault?

Permanent faults occur and will continue to exist until a faulty component is replaced.

Ex: burnt out power supply in a server

How well did you know this?

Not at all

Perfectly

What are the 5 types of failures in distributed systems?

Crash failure
Omission failure
Timing failure
Response failure
Arbitrary failure

How well did you know this?

Not at all

Perfectly

What is a Crash failure?

A server halts, but is working correctly until it halts

How well did you know this?

Not at all

Perfectly

What is an Omission failure?

A server fails to respond to incoming requests

How well did you know this?

Not at all

Perfectly

What is a Timing failure?

A server’s response lies outside the specified time interval

How well did you know this?

Not at all

Perfectly

What is a Response failure?

Study These Flashcards

A server’s response is incorrect

What is an Arbitrary Failure?

Study These Flashcards

A server may produce arbitrary responses at arbitrary times

In Distributed Systems, we mask failures using _______. One example of this is _______ computing units

Study These Flashcards

redundancy

replicating

A software technique for providing redundancy is to create a group of redundant identical processes. They can be classified as a Flat group, or a Hierarchical group.

Describe both of these paradigms.

Study These Flashcards

Flat Group:
All processes behave in the same way. They are simply replicas of each other.

Hierarchical Group:
There is 1 coordinator processes, and numerous worker processes.

What are Flat groups? how can you implement them?

Study These Flashcards

All processes play an equal role, there is no concept of a primary, or a backup.

Implemented using quorum-replications.

What are Hierarchical groups?

Study These Flashcards

There is a distinguished primary/coordinator node, which coordinates the actions of the other nodes which are backup/worker nodes.

Implemented using primary-backup server. Consensus problem is required to be solved for this

What is the consensus problem?

Study These Flashcards

We assume that each process has a procedure propose(val) and a procedure decide()
First, each process proposes a value by calling the propose function once and specifying a value at the initial state of the process
Next, each process learns the value agreed upon by calling the decide() function

3 friends are trying to figure out what to do on a Friday night. Their proposals are all different activities:
Friend #1: proposes to go see a movie
Friend #2: proposes to go to a restaurant
Friend #3: proposes to sit at home

Next, the 3 friends all decide and go to the movie.

This is an analogy of what type of problem?

Study These Flashcards

The Consensus Problem

In the consensus problem, there are 2 safety properties: Agreement, and Validity. Describe them.

Study These Flashcards

Agreement: Two calls to decide() never return different values

Validity: If a process calls decide() with response x, then some process must have invoked a call to propose(x)

In the consensus problem, what is the liveness property?

Liveness: is a process calls propose(x) or decide() and does NOT fail, then that process must eventually terminate

What are the names of the 3 properties of the consensus problem?

Safety Agreement Safety Validity Liveness Property

Solving consensus in a failure-prone distributed environment depends on 4 design factors. What are they? Note: these 4 factors influence if a consensus problem is solvable

1. Async vs Sync processes (is the time for an execution bounded) 2. Communication Delays (is there a bound with network delay on message delivery) 3. Message delivery Order (FIFO vs LIFO) 4. Unicast vs multicast

RPC systems may exhibit 5 classes of failure scenarios. What are they?

1. Client is unable to locate the server (url doesn't resolve to network address or network address doesn't give a connection) 2. The request message from the client to the server is lost 3. The server crashes after receiving a request 4. The reply message from the server to the client is lost 5. The client crashes after sending a request

What does it mean for a request to be "idempotent"?

Repeated executions have the same effect as one execution.

When an RPC server crashes upon request, often the request can be reissued. What are the effects of this? In what case is this fine?

The request may be processed multiple times by the service handler. The client may not know how much of the execution transpired. This is fine if the execution is idempotent

Sometimes when an RPC server crashes upon request, a strategy can be to ______ and report a failure. There is no ______ that the request has been processed.

give-up | guarantee

Suppose an RPC server crashes upon request. What do the following terms correspond to: 1. at-least-one semantics 2. at-most-one semantics 3. exactly one semantics

1. at-least-one semantics is when the request is reissued 2. at-most-one semantics is when the client gives up 3. exactly one semantics is when the client determines if the request is processed and reissues accordingly

What does "exactly one semantics" mean in the context of an RPC server crashing upon receiving a request? Why is this scheme difficult to implement?

The client determines if the request is processed and reissues accordingly. It is difficult to implement because the server may not have a way of knowing wether a particular action has been completed.

Suppose we have a client-server RPC set tup. The client follows an always re-issue strategy if it does not receive a response. In what case would this lead to duplicated computation from the server?

If the server executes & completes the request, and then crashes right before sending the response. The client would reissue the request and the server would compute the request twice

What is the definition of fault tolerence?

the characteristic by which a system can mask the occurrence and recovery from failures.

Module 10a - Fault Tolerance Flashcards

(35 cards)