Availability/Reliability Flashcards

Question 1

Q

The most important aspect of Availability

Answer

A

to avoid a single point of failure

Question 2

Q

What two methods are used to avoid a single point of failure

Answer

A

redundancy and diversity

Question 3

Q

What is SW redundancy

Answer

A

have multiple versions of the system running at once

Question 4

Q

What is SW diversity

Answer

A

having different ways of doing the same work

Question 5

Q

4 system architectures for ensuring availability

Answer

A

hot backup (running multiple versions)
triple voting system (have 3 versions of the system and run the best 2 at all times)
module monitoring code (code that monitors the running of a module)
dynamic post-test monitor (self-testing of hardware)

Question 6

Q

2 characteristics of a dependable SW development process

Answer

A

1) process is explicitly defined (aka well-documented)

2) process is repeatable and not subjective (works in different domains)

Question 7

Q

5 characteristics of a well-documented process

Answer

A

auditable (outsiders can check it)
diverse (multiple ways to check it)
documentable (defines process activities and resulting documentation products)
robust (can recover from failures)
standardized

Question 8

Q

3 types of formal methods

Answer

A

consistency proof
refinement
model checking

Question 9

Q

what is FM consistency proof

Answer

A

develop math/logic model for developed SW program

- prove the program is consistent with the model

Question 10

Q

what is FM refinement

Answer

A

the generation of a program from a math/logic specification using trusted correctness-preserving transformations

Question 11

Q

what is FM model checking

Answer

A

develop math/logic model for the program. show that safety constraints/requirements/invariants are true at certain parts of the program

Question 12

Q

6 cons of using FM

Answer

A

domain experts cannot check formal specs because they don’t know FM
hard to estimate cost/effort savings from FMs
few programmers have FM skills
can’t use for large systems (because it doesn’t scale)
few FM tools available
incompatible with Agile

Question 13

Q

system fault vs system error vs system failure

Answer

A

fault - the wrong code is executed

error - bad system state

failure - user sees the bad system state

Question 14

Q

3 approaches to make systems more reliable

Answer

A

fault avoidance (write better code)
fault detect and correct (monitor system and correct it when faults arise)
fault tolerant (accept faults but have workarounds for when they arise)

Question 15

Q

what is POFOD and what does it mean?

Answer

A

Probability of Failure on Demand

likelihood of system failure when a certain user action is taken

Question 16

Q

what is ROCOF and what does it mean?

Answer

Study These Flashcards

A

rate of occurrence of failure

how often on average does the system fail (across all actions) in general

Question 17

Q

Which types of reliability metric should be tracked for rare, common and short, and common and long operations respectively

Answer

Study These Flashcards

A

POFOD for rare

ROCOF for common and short

MTTF for common and long

Question 18

Q

3 categories of fault-tolerant architectures

Answer

Study These Flashcards

A

validation filter = validate all possible input and make sure it won’t break the system

recovery from failure = have a way for the system to continue on in the face of failure

redundancy = preventing a single point of failure from crashing the system

Question 19

Q

What is a protection system

Answer

Study These Flashcards

A

a normal software system with some monitoring and correction software added on separately to the system

will shut down, reboot, and recover to the last good state in the case of failure

Question 20

Q

What is a self-monitoring architecture

Answer

Study These Flashcards

A

perform operations by copying input into several streams. Each stream then performs the same operation in a different fashion. Then the outputs of each stream are compared to each other. Any discrepancies will cause a reboot and recover