L19 - Site Reliability Engineering Flashcards

1
Q

What is the role of an SRE?

A
  • An implementation of the DevOps principles
  • Responsible for reliability, availability and performance of distributed systems.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What other roles to SRE’s usually collaborate closely with?

A

Software Developers and System Administrators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What principles does the SRE role work by?

A

CALMS

  • Culture
  • Automation
  • Lean
  • Measurements
  • Sharing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

SRE personnel can either be independent teams or embedded into cross-functional teams. When is each appropriate?

A
  • Independent for large firms such as Google, Microsoft etc.
  • Embedded for smaller, agile organisations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a benefit of having independent SRE teams?

A

Easier to share knowledge with other SRE teams across the organisation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a benefit of having embedded SRE teams?

A

Less communication overhead when collaborating with team.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regarding the Culture principle. What are the 2 main aspects of ensuring good culture in an SRE team?

A

Blamelessness -> No finger pointing, culture of confidence and unity.
Shared Knowledge -> Tight communication loops and shared post mortem reports.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a Post-mortem report?

A

A log of an incident, the resulting impact, and the actions taken to resolve the issue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Regarding the Automation principle, what are the 4 reasons this is important?

A

Automation helps:
- Eliminate Toil tasks.
- Reduce human error
- Faster
- More reliable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define a toil task…

A
  • Tasks that are tedious, repetitive, manual.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What type of tasks are ideal for automation?

A

Toil tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How many incidents should an SRE deal with per shift? Give reasons…

A
  • 2 incidents in an 8-12 hour shift.
  • Prevents paper fatigue
  • Ensures higher quality resolutions as opposed to rushing solutions.
  • Reduces mental context shifting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can SRE’s implement the Lean culture principle?

A
  • Ensure low backlog of tasks and work in progresses
  • Use a Control Loop driven by an Error Budget to determine capacity for system downtime.
  • Polarizing time -> Ensure SRE’s know the tasks to work on throughout the day, reducing context switched between tasks and improving productivity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In a Control Loop driven by an Error Budget, what happens if the Error Budget is positive or negative?

A

Positive: Developers can release more features into production.

Negative: Developers can’t release any more feature into production.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is Measurements an important SRE principle?

A
  • Data should be collected and tracked to compare against benchmark metrics.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 3 core measurements for a SRE to monitor?

A
  • Service Level Indicator: Quantitive measure of some aspect of the service provided.
  • Service Level Objective: Target value or range of SLI targets ( internal measures )
  • Service Level Agreement: Contractual version of the SLO’s.
17
Q

Why is the Sharing an important SRE principle?

A
  • Sharing knowledge across organisation
  • Organisational learning
  • Sharing reports, tools and techniques.