What can affect availability

System Maintenance, Software Updates, Infrastructure issues, Malicious Attacks, System load and dependencies. Additionally, in the cloud, latency and provider issues.


How is availability measured

Availability is typically measured by SLA and using 9s. For example, Five 9s mean 99.999%


How do you monitor availability

Create a Health Check Endpoint


What should a health check endpoint monitor

Subsystems like storage, databases and third-party dependencies


What should a health check endpoint return and should you secure a health check endpoint

Status Code content, yes it should be secure


What are some methods that can be employed to ensure high availability

Queues/Streams, Throttling,


How can throttling be employed

Set a limit to individual user access, monitor metrics and reject when limit is exceeded

Disable or degrade nonessential services so that critical services can function, for example, a video call can switch to audio only during bandwidth issues

Prioritize certain users to satisfy high impact customers’ requirements


How can a queue be employed

Introduce a Queue between the task and service
The tasks are placed in the Queue

The Service can possibly be autoscaled based on Queue Size in some advanced implementations.

If a response is expected, the service must provide a suitable implementation, however, this pattern isn’t suitable for low latency response requirements


What are some resiliency patterns

Bulk Head, Circuit Breaker, Compensating Transaction, Retry, Leader Election, Scheduler Agent Supervisor, If on AWS: Multiserver Pattern, MultiDatacenter Pattern, Floating IP


What is the bulk head resiliency pattern

Partition services into groups, Limit service resources to that group, Define partitions into business and tech requirements, hiPri customers get more resources, Leverage frameworks like polly/hystrix that limit containers resources


What is the circuit breaker resiliency pattern

If a service negatively affects applications if it were to continue to run, it is shut down.


What is the compensating transaction resiliency pattern

Records all steps to a workflow and undoes them if there is a failure.


What is the retry resiliency pattern

Intelligently attempt to reestablish contact with a failing service


What is the leader election resiliency pattern

A single task instance should be elected as leader. This will coordinate the actions with other subordinate instances.