7 - Fault Tolerance Flashcards

1
Q

Dependability

A

Correctness of one component depends on another component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Dependability Example

A

Web server W requires database D.
D fails, so does W.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Dependability requires

A

Availability
Reliability
Safety
Maintainability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Availability

A

Readiness for usage
A(t) = probability that component is immediately usable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Reliability

A

Continuity of service delivery
R(t) = probability component works over time period [0,t]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Safety

A

Low risk of component failure leading to system failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Maintainability

A

Easy and quick to repair

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When is a system more available but less reliable

A

more available meaning generally online but less reliable meaning scattered downtime

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MTTF

A

Mean Time to Failure
Average time between start and failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

MTTR

A

Mean Time to Repair
Average time of repair

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

MTBF

A

Mean Time between Failures
MTBF = MTTF + MTTR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Availability (expression in relation to MTTF MTBF etc)

A

time available/whole time
MTTF/MTBF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Failure Rate z(t) of component C

A

z(t) is the (conditional) probability that C initially fails at t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Reliability R(t) of component C

A

Declines exponentially even if z(t) is constant.
R(t) = e^(-z*t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Density Function

A

f(t)
Probability of fail at

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

CDF

A

Cumulative Distribution Function
Probability to fail by t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Fault classifications

A

Permanent: exists until repaired
Intermittent: reappearing
Transient: occurs just once and disappears

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fault

A

Cause of error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Error

A

Part of component that can lead to failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Failure

A

Component is not running as expected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Dealing with faults

A

Tolerant system
Good fault removal
Fault forecasting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Are parallel programs fault tolerant?

A

Often not fault-tolerant.
Wastes resources
Crashes as soon as one component fails

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Checkpointing

A

Save program state in regular intervals and pick up from last one after a crash.

Avoid unneccessary checkpoints - predict crash

24
Q

Failure Types

(one word names)

A

Halting
Omission
Timing
Response
Arbitrary/byzantine

25
Halting failure
Component stops
26
Omission failure
Failure to respond/receive/send messages
27
Timing failure
Response outside of specified time interval
28
Response failures
Incorrect response. Value or state-transition
29
Arbitrary/byzantine failure
Arbitrary responses at arbitrary times
30
Async or sync: Reliable crash failure detection
Async - cannot be reliably detected Sync - can be reliably detected, detect by timeout but NO CERTAINTY
31
Halt fail types
stop - reliably detectable noisy - eventually reliably detectable silent - client cannot distinguish crash/omission safe - benign arbitrary - unobservable and harmful
32
Information Redundancy
Add extra error detection bits, eg checksum
33
Time redundancy
Designed to repeat action if required
34
Physical Redundancy
Add replicas of components
35
Triple Modular Redundancy
Every device replicated 3 times Each replicate should produce same result. Majority value is used
36
Process Groups
Dynamic group of (mostly) identical processes to achieve fault tolerance via physical redundancy. All members receive all messages
37
Process Group examples
Replicated data store Master-worker scheme
38
Hierarchical Group
One member is coordinator. If coordinator fails, new one must be elected.
39
Flat Group
Members coordinate together Majority decisions Robust against crashes
40
Process Groups: Centralised membership
Group server manages. - Efficient, Simple removal, Message sync easier - Single point of failure, group must be reconstructed if server ceases
41
Process Groups: Distributed membership
Processes inform all of join/leave - More fault tolerant, Many processes must fail before rebuild - Less efficient, crashed processes must be detected, message sync harder
42
k-fault-tolerant group
A group can mask any k concurrent member failures k is called degree of fault tolerance
43
k-fault-tolerant group needs to be how large for halting failures...?
k+1 because one member is enough
44
k-fault-tolerant groups needs to be how large for arbitrary/byzantine failures
2*k +1 are needed because a majority vote is required
45
Consensus
All processes in a process group agree
46
Flooding based consensus: Algorithm for each process
Pi sends its set of commands to all others. For each pair (pi,pj), Pi detects either a Pj message or crash. Pi merges all received commands If Pi detects a crash. - Pi does not exec a command. - Pi runs error routine in next round Else: - Pi selects next command and exec
47
Flooding based consensus: Algorithm after failure
If Pi received all messages, but others did not: - Pi sends msgs and decision from previous round to other processes Elseif Pi did not: - Pi signlas that it did not receive all - Pi runs command from last round
48
Byzantine Consensus: Context
Some generals in the Byzantine Empire may be traitors (faulty processes) and deliberately send false messages
49
Byzantine Consensus: Attack or retreat?
If all (honest) retreat, it is fine. If all (honest) attack, it will succeed. If a subset attacks, it fails.
50
Generally, Byzantine Consensus can be reached when...
if k processes are faulty, then at least n = 3*k+1 are required
51
Failure Detection Types/Methods
Passive Active Gossiping
52
How is a byzantine failure caused?
By a faulty/malicious process producing a false value
53
How many byzantine failures can be tolerated in Triple Module Redundancy (3 voting devices)?
1. If there's 3 devices and 1 gives false results, the other 2 will still agree and create a majority
54
How many crash failures can Triple Modular Redundancy tolerate? and why?
It can tolerate 2. Crashed devices don't return ANY results so it leaves 1 functioning voter and hence, a majority.
55
How many byzantine failures can a system of n voters tolerate?
(n-1)/2 failures
56
How many crash failures can be tolerated by a system of n voters?
n-1 crashes