Site Reliability Engineering Flashcards

(21 cards)

1
Q

How does SRE implement the Culture principle of DevOps?

A
  • A separate SRE team or sometimes consultants in development teams
  • It includes the blameless outlook and Organisational Learning.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the blameless outlook mean in SRE?

A
  • A postmortem does not single out individuals or teams for bad behavior
  • Avoids finger-pointing to encourage open communication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Organisational Learning in SRE?

A
  • Circulating postmortem reports among engineers
  • An opportunity to improve weaknesses and make the enterprise more resilient as a team
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does SRE implement the Automation principle of DevOps?

A
  • Through building software/tooling to eliminate toil, allowing focus on engineering work
  • SRE teams should spend significant time on engineering (at least 50%)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is toil?

A

Operations work necessary to run a service that is manual, repetitive, automatable, tactical, and lacks long-term value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the ideal number of incidents for an SRE to deal with in a shift? Why?

A

No more than two incidents per 8-12 hour shift to avoid rushed investigations and reporting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does SRE implement the Lean principle of DevOps?

A
  • An Error Budget to limit work in progress: when budget is positive we can release more features, when negative we must focus on resilience of current system
  • Reduce handoffs (unecessary back and forth between development and staging)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an error budget?

A
  • How much failure can be afforded to avoid breaching the SLO
  • If the budget is positive, you can continue to introduce new features
  • If the budget is breached, you must only work on reliability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does SRE implement the Measurement principle of DevOps?

A
  • Obsessively measure metrics: SLI, SLO, SLA, KPI, system (e.g., reliability) and human (e.g., toil)
  • Often compared against a benchmark or other organisations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a Service Level Indicator (SLI)?

A

A quantitative measure of some aspect of the level of service provided

Essentially, is a specific aspect “up” or “down”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a Service Level Objective (SLO)?

A

A target value or range of values for a Service Level Indicator (SLI) within a timeframe.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Service Level Agreement (SLA)?

A

A contract (with a customer) that sets out the consequences of meeting or missing an SLO.

Usually monetary compensation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does SRE implement the Sharing principle of DevOps?

A
  • Sharing of knowledge, tools, and techniques between development and operations
  • Use the same toolkit across the board
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What makes a good alert? Why might a Site Reliability Engineer (SRE) care?

A
  1. Actionable
  2. For something that could not be fixed without a human being (If automated remediation is possible at least try that)
  • An SRE cares because they lose sleep over alerts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a reliability theatre? Why might a Site Reliability Engineer care?

A
  • A traditional Network Operations Centre (NoC) or War Room is seen as a reliability theatre that impresses only the general public.
  • An SRE cares because it may limit the effectiveness of incident response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a snowflake? Why might a Site Reliability Engineer care?

A
  • A production server that is kept running through regular manual configuration tweaks made via the command line.
  • An SRE cares because they are hard to reproduce and debug.
17
Q

What are pets cattle and poultry? Why might a Site Reliability Engineer care?

A
  • Pets are virtual (snowflake) servers with names that need individual attention
  • Cattle are virtual servers with numbers that need group attention
  • Poultry are virtual containers with numbers that need group attention

An SRE cares because of their administrative cost.

18
Q

Why is autonomous better than automated? Why might a Site Reliability Engineer care?

A
  • Because it is less work.
  • An SRE cares because autonomous systems can take away a world of pain from the on-call rotation.
19
Q

What advantages are there to embedding a Site Reliability Engineer in a development team?

A
  • It builds trust between SRE and development
  • SRE gets an input into system design from the very beginning.
20
Q

What is the right number of nines?

A
  • The right number of nines is a decision made on the basis of how much downtime the business can tolerate.
21
Q

Why is it dangerous to improve a system without revising its Service Level Agreement (SLA)?

A

Because customers will consider the delivered level of reliability to be the agreed level.