7 - Fault Tolerance Flashcards

1
Q

Dependability

A

Correctness of one component depends on another component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Dependability Example

A

Web server W requires database D.
D fails, so does W.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Dependability requires

A

Availability
Reliability
Safety
Maintainability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Availability

A

Readiness for usage
A(t) = probability that component is immediately usable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Reliability

A

Continuity of service delivery
R(t) = probability component works over time period [0,t]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Safety

A

Low risk of component failure leading to system failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Maintainability

A

Easy and quick to repair

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When is a system more available but less reliable

A

more available meaning generally online but less reliable meaning scattered downtime

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MTTF

A

Mean Time to Failure
Average time between start and failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

MTTR

A

Mean Time to Repair
Average time of repair

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

MTBF

A

Mean Time between Failures
MTBF = MTTF + MTTR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Availability (expression in relation to MTTF MTBF etc)

A

time available/whole time
MTTF/MTBF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Failure Rate z(t) of component C

A

z(t) is the (conditional) probability that C initially fails at t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Reliability R(t) of component C

A

Declines exponentially even if z(t) is constant.
R(t) = e^(-z*t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Density Function

A

f(t)
Probability of fail at

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

CDF

A

Cumulative Distribution Function
Probability to fail by t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Fault classifications

A

Permanent: exists until repaired
Intermittent: reappearing
Transient: occurs just once and disappears

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fault

A

Cause of error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Error

A

Part of component that can lead to failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Failure

A

Component is not running as expected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Dealing with faults

A

Tolerant system
Good fault removal
Fault forecasting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Are parallel programs fault tolerant?

A

Often not fault-tolerant.
Wastes resources
Crashes as soon as one component fails

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Checkpointing

A

Save program state in regular intervals and pick up from last one after a crash.

Avoid unneccessary checkpoints - predict crash

24
Q

Failure Types

(one word names)

A

Halting
Omission
Timing
Response
Arbitrary/byzantine

25
Q

Halting failure

A

Component stops

26
Q

Omission failure

A

Failure to respond/receive/send messages

27
Q

Timing failure

A

Response outside of specified time interval

28
Q

Response failures

A

Incorrect response. Value or state-transition

29
Q

Arbitrary/byzantine failure

A

Arbitrary responses at arbitrary times

30
Q

Async or sync: Reliable crash failure detection

A

Async - cannot be reliably detected
Sync - can be reliably detected, detect by timeout but NO CERTAINTY

31
Q

Halt fail types

A

stop - reliably detectable
noisy - eventually reliably detectable
silent - client cannot distinguish crash/omission
safe - benign
arbitrary - unobservable and harmful

32
Q

Information Redundancy

A

Add extra error detection bits, eg checksum

33
Q

Time redundancy

A

Designed to repeat action if required

34
Q

Physical Redundancy

A

Add replicas of components

35
Q

Triple Modular Redundancy

A

Every device replicated 3 times
Each replicate should produce same result.
Majority value is used

36
Q

Process Groups

A

Dynamic group of (mostly) identical processes to achieve fault tolerance via physical redundancy.

All members receive all messages

37
Q

Process Group examples

A

Replicated data store
Master-worker scheme

38
Q

Hierarchical Group

A

One member is coordinator.

If coordinator fails, new one must be elected.

39
Q

Flat Group

A

Members coordinate together

Majority decisions
Robust against crashes

40
Q

Process Groups: Centralised membership

A

Group server manages.
- Efficient, Simple removal, Message sync easier
- Single point of failure, group must be reconstructed if server ceases

41
Q

Process Groups: Distributed membership

A

Processes inform all of join/leave

  • More fault tolerant, Many processes must fail before rebuild
  • Less efficient, crashed processes must be detected, message sync harder
42
Q

k-fault-tolerant group

A

A group can mask any k concurrent member failures

k is called degree of fault tolerance

43
Q

k-fault-tolerant group needs to be how large for halting failures…?

A

k+1 because one member is enough

44
Q

k-fault-tolerant groups needs to be how large for arbitrary/byzantine failures

A

2*k +1 are needed because a majority vote is required

45
Q

Consensus

A

All processes in a process group agree

46
Q

Flooding based consensus: Algorithm for each process

A

Pi sends its set of commands to all others.
For each pair (pi,pj), Pi detects either a Pj message or crash.
Pi merges all received commands
If Pi detects a crash.
- Pi does not exec a command.
- Pi runs error routine in next round
Else:
- Pi selects next command and exec

47
Q

Flooding based consensus: Algorithm after failure

A

If Pi received all messages, but others did not:
- Pi sends msgs and decision from previous round to other processes
Elseif Pi did not:
- Pi signlas that it did not receive all
- Pi runs command from last round

48
Q

Byzantine Consensus: Context

A

Some generals in the Byzantine Empire may be traitors (faulty processes) and deliberately send false messages

49
Q

Byzantine Consensus: Attack or retreat?

A

If all (honest) retreat, it is fine.
If all (honest) attack, it will succeed.
If a subset attacks, it fails.

50
Q

Generally, Byzantine Consensus can be reached when…

A

if k processes are faulty, then at least n = 3*k+1 are required

51
Q

Failure Detection Types/Methods

A

Passive
Active
Gossiping

52
Q

How is a byzantine failure caused?

A

By a faulty/malicious process producing a false value

53
Q

How many byzantine failures can be tolerated in Triple Module Redundancy (3 voting devices)?

A

1.

If there’s 3 devices and 1 gives false results, the other 2 will still agree and create a majority

54
Q

How many crash failures can Triple Modular Redundancy tolerate? and why?

A

It can tolerate 2.

Crashed devices don’t return ANY results so it leaves 1 functioning voter and hence, a majority.

55
Q

How many byzantine failures can a system of n voters tolerate?

A

(n-1)/2 failures

56
Q

How many crash failures can be tolerated by a system of n voters?

A

n-1 crashes