18 Flashcards

Question 1

Q

Defina sistema tolerante a fallos

Answer

A

En un sistema tolerante a fallos, en presencia de fallos, el mismo continúa operando en forma aceptable.

To understand the role of fault tolerance in distributed systems we first need to take a closer look at what it actually means for a distributed system to tolerate faults. Being fault tolerant is strongly related to what are called dependable systems. Dependability is a term that covers a number of useful requirements for distributed systems including the following [Kopetz and Verissimo, 1993]:

Availability
Reliability
Safety
Maintainability

Availability is defined as the property that a system is ready to be used immediately. In general, it refers to the probability that the system is operating correctly at any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is one that will most likely be working at a given instant in time.

Reliability refers to the property that a system can run continuously without failure. In contrast to availability, reliability is defined in terms of a time interval instead of an instant in time. A highly reliable system is one that will most likely continue to work without interruption during a relatively long period of time. This is a subtle but important difference when compared to availability. If a system goes down on average for one, seemingly random millisecond every hour, it has an availability of more than 99.9999 percent, but is still unreliable. Similarly, a system that never crashes but is shut down for two specific weeks every August has high reliability but only 96 percent availability. The two are not the same.

Safety refers to the situation that when a system temporarily fails to operate correctly, no catastrophic event happens. For example, many process- control systems, such as those used for controlling nuclear power plants or sending people into space, are required to provide a high degree of safety. If such control systems temporarily fail for only a very brief moment, the effects could be disastrous. Many examples from the past (and probably many more yet to come) show how hard it is to build safe systems.

Finally, maintainability refers to how easily a failed system can be re- paired. A highly maintainable system may also show a high degree of availability, especially if failures can be detected and repaired automatically. However, as we shall see later in this chapter, automatically recovering from failures is easier said than done.

Question 2

Q

Clasificación de fallas

Answer

A

Por un lado, los fallos se pueden clasificar en transientes (ocurren una vez y luego desaparecen; si se repite la operación, el fallo desaparece), intermitentes (ocurren en forma intermitente; difíciles de diagnosticar), y permanentes (existen hasta que el componente defectuoso se reemplaza). Es muy importante diferenciar entre fallos improbables e imposibles.

Por otro lado, los fallos se pueden clasificar en fallos de crash (servicio se detiene), timing (rta. fuera de los tiempos aceptables), omisión (el servicio falla al responder solicitudes entrantes), respuesta (la rta. es incorrecta, sea por valor incorrecto o desvío en el flujo), arbitraria o bizantina (arbitraria en tiempo y rta.; diferente información para diferentes consumidores de esa información).

Faults are generally classified as transient, intermittent, or permanent. Transient faults occur once and then disappear. If the operation is repeated, the fault goes away. A bird flying through the beam of a microwave transmitter may cause lost bits on some network (not to mention a roasted bird). If the transmission times out and is retried, it will probably work the second time.

An intermittent fault occurs, then vanishes of its own accord, then reap- pears, and so on. A loose contact on a connector will often cause an inter- mittent fault. Intermittent faults cause a great deal of aggravation because they are difficult to diagnose. Typically, when the fault doctor shows up, the system works fine.

A permanent fault is one that continues to exist until the faulty compo- nent is replaced. Burnt-out chips, software bugs, and disk-head crashes are examples of permanent faults.

Question 3

Q

Niveles de tolreancia a fallas

Answer

A

Para definir el nivel de tolerancia a fallos de un sistema, es necesario indicar en qué condiciones trabaja. Las condiciones se pueden clasificar en condiciones de entorno (entorno físico del hardware - temperatura, resistencia a vibraciones y polvo, ubicación - interferencia, ruido y clock drift) u operacionales (especificaciones, valores límite y tiempos de rta., ancho de banda, latencia y protocolos aceptados).

Question 4

Q

clasificación de deteccion de errores

Answer

A

La detección de errores se puede clasificar en fault removal (remover errores antes de que estos sucedan; ej.: ECC Memory), fault forecasting (determinar la probabilidad de que un componente pueda llegar a fallar), fault prevention/avoidance (evitar las condiciones que llevan a la generación de errores) y fault tolerance (procesar errores en el sistema y tratar los mismos en vez de evitar que sucedan).

Question 5

Q

resiliencia

Answer

A

Se define resiliencia a la capacidad de mantener un nivel aceptable de servicio en presencia de fallos y desafíos a la operación normal (como errores de configuración u operacionales, desastres naturales, factores políticos o ataques maliciosos).

Question 6

Q

degradacion suave

Answer

A

Se define degradación suave (graceful degradation) a cuando el comportamiento difiere en ausencia de fallos pero continúa siendo aceptable.

Los fallos se pueden tolerar mediante la redundancia física (replicación), de información o de tiempo (reintentos).

Question 7

Q

tipos de replicacion

Answer

A

pasiva (una réplica primaria y varias secundarias o de backup)
activa (múltiples réplicas de la misma máquina de estado que ejecutan las mismas operaciones en el mismo orden (por lo tanto: orden total)
semiactiva (todas las réplicas ejecutan los comandos pero una sola, el líder, toma las decisiones no determinísticas).

Question 8

Q

medidas de confiabilidad

Answer

A

Las medidas de confiabilidad de un sistema vienen dadas por medio de la
- availability (prob. de que el sistema esté operando correctamente),

reliability (capacidad del sistema para dar servicio correcto en forma continuada),
safety (en presencia de fallos no ocurre nada catastrófico y el sistema debe poder ser recuperado automática o manualmente) y
maintainability (la cantidad de tiempo que se requiera para actualizar o reparar el sistema).

18 Flashcards

(8 cards)