Lesson 7: Fault Tolerance Flashcards

1
Q

What is a fault-tolerant system?

A

A system where we can detect a fault, remove its effect, and proceed normally.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the rollback-recovery technique?

A

When a failure is detected we rollback a previous state (consistent cut) that we know is correct, then continue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the checkpointing mechanism?

A

Save the state of the process (or entire node) to persistent storage. If there is a failure, the checkpoint can be used to rebuild the state of the system before the failure.

+ restart is instantaneous
- lots of I/O on checkpoint (can be improved by only saving the deltas)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the logging mechanism?

A

Log information about operations performed. Record the original value (so we can UNDO) or log new value (so we can REDO)

+ smaller amount of I/O to write to disk
- recovery takes longer
- regular operations may take longer (search in log)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the checkpointing + logging mechanism?

A

Combines both checkpointing and logging mechanisms: checkpoint to move the recovery line to a more recent consistent cut. Log from that point on.

+ limit duration of recovery
+ limit space needed to store log
- must detect stable consistent cut

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is uncoordinated checkpointing?

A

Processes take checkpoints independently. On failure, we need to construct a consistent cut.

Problems
- Domino effect: could lose all your work
- Useless checkpoints: checkpoints that can never form a globally consistent state may be taken
- Multiple checkpoints per process: may need more than the most recent snapshots
- Garbage collection: needed to identify obsolete checkpoints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is coordinated checkpointing?

A

Processes coordinate their checkpoints so they get a consistent state

Pros:
+ recovery no longer requires a dependency graph to calculate a recovery line. the latest checkpoint can be used
+ no domino effect. the coordination guarantees that the checkpoints taken are part of a consistent cut
+ single checkpoint per process
+ no garbage collection

Challenges
- how to coordinate?
- no synchronous clock guarantee
- message delivery reliable and in bounded time?
- are all checkpoints needed?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is communication-induced checkpointing?

A

We use the global snapshot algorithm, but rather than send a marker message (which means we need FIFO) we can piggyback the marker message on a normal message.

Nodes that aren’t communicating with other nodes can take periodic snapshots.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly