Availability/Reliability Flashcards
The most important aspect of Availability
to avoid a single point of failure
What two methods are used to avoid a single point of failure
redundancy and diversity
What is SW redundancy
have multiple versions of the system running at once
What is SW diversity
having different ways of doing the same work
4 system architectures for ensuring availability
- hot backup (running multiple versions)
- triple voting system (have 3 versions of the system and run the best 2 at all times)
- module monitoring code (code that monitors the running of a module)
- dynamic post-test monitor (self-testing of hardware)
2 characteristics of a dependable SW development process
1) process is explicitly defined (aka well-documented)
2) process is repeatable and not subjective (works in different domains)
5 characteristics of a well-documented process
- auditable (outsiders can check it)
- diverse (multiple ways to check it)
- documentable (defines process activities and resulting documentation products)
- robust (can recover from failures)
- standardized
3 types of formal methods
- consistency proof
- refinement
- model checking
what is FM consistency proof
- develop math/logic model for developed SW program
- prove the program is consistent with the model
what is FM refinement
the generation of a program from a math/logic specification using trusted correctness-preserving transformations
what is FM model checking
develop math/logic model for the program. show that safety constraints/requirements/invariants are true at certain parts of the program
6 cons of using FM
- domain experts cannot check formal specs because they don’t know FM
- hard to estimate cost/effort savings from FMs
- few programmers have FM skills
- can’t use for large systems (because it doesn’t scale)
- few FM tools available
- incompatible with Agile
system fault vs system error vs system failure
fault - the wrong code is executed
error - bad system state
failure - user sees the bad system state
3 approaches to make systems more reliable
- fault avoidance (write better code)
- fault detect and correct (monitor system and correct it when faults arise)
- fault tolerant (accept faults but have workarounds for when they arise)
what is POFOD and what does it mean?
Probability of Failure on Demand
likelihood of system failure when a certain user action is taken
what is ROCOF and what does it mean?
rate of occurrence of failure
how often on average does the system fail (across all actions) in general
Which types of reliability metric should be tracked for rare, common and short, and common and long operations respectively
POFOD for rare
ROCOF for common and short
MTTF for common and long
3 categories of fault-tolerant architectures
validation filter = validate all possible input and make sure it won’t break the system
recovery from failure = have a way for the system to continue on in the face of failure
redundancy = preventing a single point of failure from crashing the system
What is a protection system
a normal software system with some monitoring and correction software added on separately to the system
will shut down, reboot, and recover to the last good state in the case of failure
What is a self-monitoring architecture
perform operations by copying input into several streams. Each stream then performs the same operation in a different fashion. Then the outputs of each stream are compared to each other. Any discrepancies will cause a reboot and recover
How to create software diversity from the SW process
having different teams code duplicate units and not socialize with each other
4 ways to achieve SW diversity
- use different software design styles
- use different programming langs
- use different development tools
- use different algorithms
8 good programming practices to ensure reliability in code
- control data visibility
- validate all inputs
- provide handlers for all thrown exceptions
- minimize the use of error-prone constructs
- make SW restartable
- check array bounds
- include timeouts for RPC
- include units in src code numbers