Pillar Five - Operational Excellence Flashcards Preview

SA-12-The Well Architected Framework > Pillar Five - Operational Excellence > Flashcards

Flashcards in Pillar Five - Operational Excellence Deck (14):

Operational Excellence

  1. Includes operational practices and procedures to manage production workloads
  2. Includes how planned changes are executed, and responses to unexpected events
  3. Recommendation:
    1. Change execution and responses should be automated
    2. All processes and procedures should be documented, tested, and regularly reviewed


Design principles

  1. Perform operations with code
  2. Align operation processes to business objectives (eg. How is operations meeting business needs)
  3. Make regular, small, incremental changes
  4. Test for responses to unexpected events
  5. Learn from operational events and failures
  6. Keep operations procedures current (documentation, runbacks, playbooks, procedures, etc..)


There are three best practice areas for operational excellence in the cloud

  1. Preparation
  2. Operation
  3. Response


Best Practices: Preparation (part 1)

  1. Preparation drives operational excellence
  2. Checklists ensures that workloads are ready for production and prevent mistakes
  3. Workloads should have
    1. Runbacks - which offer guidance that operations teams can refer to for normal tasks
    2. Playbooks - which offer guidance to unexpected events (response plans, escalation paths, and stakeholder notification)
  4. AWS CloudFront
    1. Can be used to ensure environments contain all required resources, and that the configurations are correct based on tested best practices


Best Practices: Preparation (part 2)

  1. Auto Scaling
    1. Provide auto mated scaling mechanisms to respond to business related events that affect operations needs
  2. Tagging
    1. To make sure all resources in a workload can be easily identified when needed during responses
  3. Accurate Documentation
    1. Information can become stale and needs to be updated regularly and tested
    2. Should include:
      1. Application designs
      2. Environment configurations
      3. Resource configurations
      4. Response plans
      5. Mitigation plans


Best Practices: Preparation (part 3)

  1. Deployments
    1. CI / CD pipelines (e.g. source code repository, build systems deployment, testing automation)
    2. Release management - small changes, tested, incremental, & tracked
    3. Roll Back - revert without introducing operational issues or causing operational impact 


Best Practices: Operation

  1. Standardized, manageable, routine basis
  2. Automation, small changes, regular quality assurance testing
  3. Mechanisms to track, audit, roll back, and review changes
  4. Changes should not be large, infrequent, need scheduled downtime, or manual
  5. KPIs should be collected and reviewed
  6. Automation to failures
  7. Avoid manual processes for deployments, release management, changes, rollbacks
  8. Align monitoring to business needs
  9. Avoid ad hoc and non-centralized monitoring


Best Practices: Response

  1. Responses should be automated (mitigation, remediation, rollback, and recovery)
  2. Alerts should be timely, and invoke escalations when automated responses are not enough
  3. QA mechanisms should be in play to automatically roll back failed deployments
  4. Responses should follow a pre-defined playbook
  5. Escalation paths should be defined and include both functional and hierarchical escalation paths
  6. Hierarchical escalations should be automated
  7. Escalated priority should result in stakeholder notifications


AWS Key Services: Preparation

  1. Preparation
    1. AWS Config - provides detailed inventory of your AWS resources, configurations, and continuously records configuration changes
    2. Service Catalog - helps to create a standardized set of service offerings that are aligned to best practices
    3. Designing workloads to use automation with services like Auto Scaling, SQS


AWS Key Services: Operation

  1. Tools to manage and automate code changes to AWS workloads
    1. AWS CodeCommit
    2. AWS CodeDeploy
    3. ASW CodePipeline
  2. Use AWS SDKs to automate operatonal changes
  3. Use AWS CloudTrail to audit and track changes made to AWS environments


AWS Key Services: Response

  1. Response
    1. CloudWatch - for effective automated responses
    2. CloudWatch to set alerting and notification
    3. CloudWatch - to trigger automated response


Questions: Preparation

  1. What best practices for cloud operations are you using?
  2. How are you doing configuration management for your workload?


Questions: Operations

  1. How are you evolving your workload the minimizing the impact of change?
  2. How are you monitoring? 


Questions: Response

  1. How do you respond to unplanned operational events?
  2. How is escalation managed when responding to unplanned operational events?