chapter 7 Flashcards

(41 cards)

1
Q

what is Dependability of components.

A

A component C depends on C∗ if the correctness of C’s behavior depends on the correctness of C∗’s behavior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what are reqs for dependability? list and describe them.

A

Availability - readiness to be used
Reliability - Continuity of service delivery
Safety - Very low probability of catastrophes
Maintainability - How easy can a failed system be repaired

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is Reliability R(t)?

A

probability that a component has been up and running continuously in the time interval [0,t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are the traditional metrics to measure realiability?

A
  • Mean Time To Failure (MTTF): Average time until a component fails
  • Mean Time To Repair(MTTR): Average time it takes to repair a failed component.
  • Mean Time Between Failures(MTBF): MTTF + MTTR
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is Availability A(t)?

A

Average fraction of time that a component has been up and running in the interval [0,t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how can we calculate Availability A(t)?

A

A = MTTF /MTBF = MTTF /(MTTF + MTTR )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

describe faliure and give example

A
  • May occur when a component is not living up to its specifications.
    – A crashed program
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

describe error and give example

A
  • Part of a component that may lead to a failure
    – A programming bug
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

describe fault and give example

A
  • The cause of an error
    – A sloppy programmer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

describe fault prevention and give example

A
  • Prevent the occurrence of a fault
    – Don’t hire sloppy programmers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

describe fault tolerance and give example

A
  • Build a component that can mask the occurrence of a fault
    – Build each component by two independent programmers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

describe fault removal and give example

A
  • Reduce the presence, number, or seriousness of a fault
    – Get rid of sloppy programmers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

describe Fault forecasting and give example

A
  • Estimate current presence, future incidence, and consequences of faults
    – Estimate how a recruiter is doing when it comes to hiring sloppy programmers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is a Crash failure?

A

Component halts, but behaves correctly before halting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is an Omission failure?

A
  • Failure in sending or receiving messages
    – Receiving omissions: sent messages are not received
    – Send omissions: messages are not sent that should have
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is a Timing failure?

A
  • Output [ response ] is correct, but lies outside a specified interval.
    – Performance failures: the component is too slow
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is a Response failure?

A
  • The components response is incorrect
    – Value failure : The value of the response is wrong
    – State transition failure : The server deviates from the correct flow of control and into a wrong state
18
Q

what is an Arbitrary failure?

A

Component produces arbitrary output and be subject to arbitrary timing failures

19
Q

what is a Commission failure?

A

A component takes an action that it should not have taken

20
Q

what is a Deliberate failure

A

can be omission or commission failures, that stretch out to the field of security

21
Q

describe if possible, how we can Distinguishing between a crash or omission/timing failure.

A
  1. Asynchronous system: no assumptions about process execution speeds or message delivery times → cannot reliably detect crash failures.
  2. Synchronous system: process execution speeds and message delivery times are bounded → we can reliably detect omission and timing failures.
  3. Partially synchronous systems: most of the time, we can assume the system to be synchronous, yet there is no bound on the time that a system is asynchronous → can normally reliably detect crash failures.
22
Q

what assumptions can we make about crash failures?

A
  • Fail-stop: The component exhibits crash failures, but its failure can be detected (either through announcement or timeouts)
  • Fail-noisy: Crash failures, eventually reliably detectable
  • Fail-silent: The component exhibits omission or crash failures; clients
    cannot tell what went wrong
  • Fail-safe: The component exhibits arbitrary, but benign failures (they can’t do any harm)
  • Fail-arbitrary Arbitrary, with malicious failures
23
Q

what is the Basic approach to Process Resilience?

A

replicate a process and organize them into a group; if a
process in the group fails the others take over.

24
Q

what are the 2 techniques to achieve process resilience?

A

flat groups
hierarchical groups

25
describe a flat group
- all processes are equal - good for fault tolerance since information exchange immediately occurs with all group members - imposes higher overhead as control is completely distributed - hard to implement
26
describe a Hierarchical group.
- All communication go through a single coordinator - Loss of the coordinator brings the entire group to a halt - not really fault tolerant and scalable - easier to implement
27
what is a K-fault tolerant group?
When a group can mask any k concurrent member failures (k is called degree of fault tolerance).
28
How large does a k-fault tolerant group need to be?
- Assume crash/performance failure semantics ⇒ a total of k + 1 members are needed to survive k member failures. - Assume arbitrary failure semantics, and group output defined by voting ⇒ a total of 2k + 1 members are needed to survive k member failures.
29
describe the assumptions and basic idea of Flooding-based consensus
Assume: - Fail-stop semantics - when a process crashes, this can be reliably detected. - Reliable failure detection - a process P can indeed reliably detect that Q crashed - Unreliable communication Basic idea: - A client contacts a Pi requesting it to execute a command - Every Pi maintains a list of proposed commands - A process group P = {P1,...,Pn} - In round r, Pi multicasts its known set of commands C to all other processes
30
what are the assumptions made by Paxos consensus?
- An asynchronous system - Communication may be unreliable (meaning that messages may be lost, duplicated, or reordered) - Corrupted messages are detectable (and can thus be discarded) - All operations are deterministic ( can't be interrupted ) - Process may exhibit halting failures, but not arbitrary failures, nor do they collude.
31
what are the Essentials of a Paxos consensus?
1. client - a thread that requests to have an operation performed 2. proposer - a thread that takes a client’s request and attempts to have the requested operation accepted for execution 3. acceptor - a thread that operates in a quorum to vote for the execution of an operation 4. learner - a thread that eventually performs an operation
32
what are the guarantees of a Paxos consensus?
1. Safety (nothing bad will happen): - Only proposed operations will be learned - At most one operation will be learned (and subsequently executed before a next operation is learned) 2. Liveness (something good will eventually happen): - If sufficient processes remain non faulty, then a proposed operation will eventually be learned
33
How can we reliably detect that a process has actually crashed using the general model?
- Each process is equipped with a failure detection module - A process p probes another process q for a reaction: ---- q reacts →q is alive ---- q does not react within t time units → q is suspected to have crashed Note: in a synchronous system: - a suspected crash is a known crash - referred to as a perfect failure detector
34
what do we call a perfect failure detector in practice? and what are it's two important properties?
- the eventually perfect failure detector 1. Strong completeness : every crashed process is eventually suspected to have crashed by every correct process. 2. Eventual strong accuracy : eventually, no correct process is suspected by any other correct process to have crashed.
35
what is the implementation of the eventually perfect failure detector?
- If p did not receive heartbeat from q within time t → p suspects q. - If q later sends a message (received by p): ---- p stops suspecting q ---- p increases timeout value t - Note: if q does crash, p will keep suspecting q.
36
What can go wrong during RPC communication?
1: Client cannot locate server 2: Client request is lost 3: Server crashes 4: Server response is lost 5: Client crashes
37
what are the RPC communication: Solutions 1 and 2?
1: report back to client 2: Just resend message
38
what are the RPC communication: Solutions 3?
3: We need to decide on what we expect from the server: A. At-least-once-semantics: The server guarantees it will carry out an operation at least once, no matter what. [ read ] B. At-most-once-semantics: The server guarantees it will carry out an operation at most once. [ write, transfer 10k ]
39
what are the RPC communication: Solutions 4?
4: Detecting lost replies can be hard, because it can also be that the server had crashed. You don’t know whether the server has carried out the operation Solution: None, except that you can try to make your operations: - idempotent: repeatable without any harm done if it happened to be carried out before.
40
what is an orphan computation?
Client crashes but The server is doing work and holding resources for nothing
41
what are the RPC communication: Solutions 5?
- Orphan is killed (or rolled back) by client when it reboots - Broadcast new epoch number when recovering ⇒ servers kill orphans - Require computations to complete in a T time units. Old ones are simply removed.