Site Reliability Engineering Flashcards

Question 1

Q

How does SRE implement the Culture principle of DevOps?

Answer

A

A separate SRE team or sometimes consultants in development teams
It includes the blameless outlook and Organisational Learning.

Question 2

Q

What does the blameless outlook mean in SRE?

Answer

A

A postmortem does not single out individuals or teams for bad behavior
Avoids finger-pointing to encourage open communication

Question 3

Q

What is Organisational Learning in SRE?

Answer

A

Circulating postmortem reports among engineers
An opportunity to improve weaknesses and make the enterprise more resilient as a team

Question 4

Q

How does SRE implement the Automation principle of DevOps?

Answer

A

Through building software/tooling to eliminate toil, allowing focus on engineering work
SRE teams should spend significant time on engineering (at least 50%)

Question 5

Q

What is toil?

Answer

A

Operations work necessary to run a service that is manual, repetitive, automatable, tactical, and lacks long-term value

Question 6

Q

What is the ideal number of incidents for an SRE to deal with in a shift? Why?

Answer

A

No more than two incidents per 8-12 hour shift to avoid rushed investigations and reporting.

Question 7

Q

How does SRE implement the Lean principle of DevOps?

Answer

A

An Error Budget to limit work in progress: when budget is positive we can release more features, when negative we must focus on resilience of current system
Reduce handoffs (unecessary back and forth between development and staging)

Question 8

Q

What is an error budget?

Answer

A

How much failure can be afforded to avoid breaching the SLO
If the budget is positive, you can continue to introduce new features
If the budget is breached, you must only work on reliability

Question 9

Q

How does SRE implement the Measurement principle of DevOps?

Answer

A

Obsessively measure metrics: SLI, SLO, SLA, KPI, system (e.g., reliability) and human (e.g., toil)
Often compared against a benchmark or other organisations

Question 10

Q

What is a Service Level Indicator (SLI)?

Answer

A

A quantitative measure of some aspect of the level of service provided

Essentially, is a specific aspect “up” or “down”

Question 11

Q

What is a Service Level Objective (SLO)?

Answer

A

A target value or range of values for a Service Level Indicator (SLI) within a timeframe.

Question 12

Q

What is a Service Level Agreement (SLA)?

Answer

A

A contract (with a customer) that sets out the consequences of meeting or missing an SLO.

Usually monetary compensation

Question 13

Q

How does SRE implement the Sharing principle of DevOps?

Answer

A

Sharing of knowledge, tools, and techniques between development and operations
Use the same toolkit across the board

Question 14

Q

What makes a good alert? Why might a Site Reliability Engineer (SRE) care?

Answer

A

Actionable
For something that could not be fixed without a human being (If automated remediation is possible at least try that)

An SRE cares because they lose sleep over alerts

Question 15

Q

What is a reliability theatre? Why might a Site Reliability Engineer care?

Answer

A

A traditional Network Operations Centre (NoC) or War Room is seen as a reliability theatre that impresses only the general public.
An SRE cares because it may limit the effectiveness of incident response

Question 16

Q

What is a snowflake? Why might a Site Reliability Engineer care?

Answer

Study These Flashcards

A

A production server that is kept running through regular manual configuration tweaks made via the command line.
An SRE cares because they are hard to reproduce and debug.

Question 17

Q

What are pets cattle and poultry? Why might a Site Reliability Engineer care?

Answer

Study These Flashcards

A

Pets are virtual (snowflake) servers with names that need individual attention
Cattle are virtual servers with numbers that need group attention
Poultry are virtual containers with numbers that need group attention

An SRE cares because of their administrative cost.

Question 18

Q

Why is autonomous better than automated? Why might a Site Reliability Engineer care?

Answer

Study These Flashcards

A

Because it is less work.
An SRE cares because autonomous systems can take away a world of pain from the on-call rotation.

Question 19

Q

What advantages are there to embedding a Site Reliability Engineer in a development team?

Answer

Study These Flashcards

A

It builds trust between SRE and development
SRE gets an input into system design from the very beginning.

Question 20

Q

What is the right number of nines?

Answer

Study These Flashcards

A

The right number of nines is a decision made on the basis of how much downtime the business can tolerate.

Question 21

Q

Why is it dangerous to improve a system without revising its Service Level Agreement (SLA)?

Answer

Study These Flashcards

A

Because customers will consider the delivered level of reliability to be the agreed level.

Site Reliability Engineering Flashcards

(21 cards)