Site Reliability Engineering Flashcards
(21 cards)
How does SRE implement the Culture principle of DevOps?
- A separate SRE team or sometimes consultants in development teams
- It includes the blameless outlook and Organisational Learning.
What does the blameless outlook mean in SRE?
- A postmortem does not single out individuals or teams for bad behavior
- Avoids finger-pointing to encourage open communication
What is Organisational Learning in SRE?
- Circulating postmortem reports among engineers
- An opportunity to improve weaknesses and make the enterprise more resilient as a team
How does SRE implement the Automation principle of DevOps?
- Through building software/tooling to eliminate toil, allowing focus on engineering work
- SRE teams should spend significant time on engineering (at least 50%)
What is toil?
Operations work necessary to run a service that is manual, repetitive, automatable, tactical, and lacks long-term value
What is the ideal number of incidents for an SRE to deal with in a shift? Why?
No more than two incidents per 8-12 hour shift to avoid rushed investigations and reporting.
How does SRE implement the Lean principle of DevOps?
- An Error Budget to limit work in progress: when budget is positive we can release more features, when negative we must focus on resilience of current system
- Reduce handoffs (unecessary back and forth between development and staging)
What is an error budget?
- How much failure can be afforded to avoid breaching the SLO
- If the budget is positive, you can continue to introduce new features
- If the budget is breached, you must only work on reliability
How does SRE implement the Measurement principle of DevOps?
- Obsessively measure metrics: SLI, SLO, SLA, KPI, system (e.g., reliability) and human (e.g., toil)
- Often compared against a benchmark or other organisations
What is a Service Level Indicator (SLI)?
A quantitative measure of some aspect of the level of service provided
Essentially, is a specific aspect “up” or “down”
What is a Service Level Objective (SLO)?
A target value or range of values for a Service Level Indicator (SLI) within a timeframe.
What is a Service Level Agreement (SLA)?
A contract (with a customer) that sets out the consequences of meeting or missing an SLO.
Usually monetary compensation
How does SRE implement the Sharing principle of DevOps?
- Sharing of knowledge, tools, and techniques between development and operations
- Use the same toolkit across the board
What makes a good alert? Why might a Site Reliability Engineer (SRE) care?
- Actionable
- For something that could not be fixed without a human being (If automated remediation is possible at least try that)
- An SRE cares because they lose sleep over alerts
What is a reliability theatre? Why might a Site Reliability Engineer care?
- A traditional Network Operations Centre (NoC) or War Room is seen as a reliability theatre that impresses only the general public.
- An SRE cares because it may limit the effectiveness of incident response
What is a snowflake? Why might a Site Reliability Engineer care?
- A production server that is kept running through regular manual configuration tweaks made via the command line.
- An SRE cares because they are hard to reproduce and debug.
What are pets cattle and poultry? Why might a Site Reliability Engineer care?
- Pets are virtual (snowflake) servers with names that need individual attention
- Cattle are virtual servers with numbers that need group attention
- Poultry are virtual containers with numbers that need group attention
An SRE cares because of their administrative cost.
Why is autonomous better than automated? Why might a Site Reliability Engineer care?
- Because it is less work.
- An SRE cares because autonomous systems can take away a world of pain from the on-call rotation.
What advantages are there to embedding a Site Reliability Engineer in a development team?
- It builds trust between SRE and development
- SRE gets an input into system design from the very beginning.
What is the right number of nines?
- The right number of nines is a decision made on the basis of how much downtime the business can tolerate.
Why is it dangerous to improve a system without revising its Service Level Agreement (SLA)?
Because customers will consider the delivered level of reliability to be the agreed level.