Final Part 2 Flashcards

1
Q

What are two common architecture structures in modern applications?

A

Monoliths and Microservices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What type of architecture are microservices?

A

Service-oriented architecture(SOA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Key characteristics of Service-oriented architecture (SOA)

A

– Components or units of functionality that logically manage business functions.

– Every unit is self-contained.

– Users don’t need to know how a component works, only how to interact with it.

– Other services can exist within a unit, but components are loosely coupled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Beyond the decoupling of logic, well-architected microservices offer large engineering organizations what?

A

The capability to split teams by components in which each group of engineers “owns” a service (or group of services) from ideation to production.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the contrast between microservices and a monolith?

A

Microservices can interact freely with each other and those services will pass information around until data needs to
be saved or retrieved from the database.

Microservices involve a much more free-form architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

From an operations perspective what do microservices simplify?

A

Deployments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Because microservices are typically smaller in nature, with fewer lines of code, than a monolith what can you do?

A

You can more easily deploy small changes frequently, thus eliminating a common challenge in adopting continuous delivery or continuous deployment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Microservices enable refine and targeted ____?

A

Scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fail-fast system

A

A system that immediately reports at its interface any condition that is likely to indicate a failure. A fail-fast component will fail at the first sign of a problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In modern development, system components act independently and can change behavior if?

A

A failure is detected in a neighboring component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Fail-fast systems features can do what?

A

Make your system more fault tolerant, allowing it to function even as failure is occurring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

If you implement your system well with failure checks at each potential breaking point

A

It will show failure earlier than would be typical because you’re made aware of the failure far before a cascading series of failures can cause catastrophic consequences.

In other words, each component is treated independently in failure
detection, so a domino effect is less likely to occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Fail-safe system

A

A system that shuts down operation immediately on discovering a failure to ensure the safety of humans, equipment, data, and any other assets that could be damaged.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Kaizen

A

The Japanese word for improvement or change for the better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The most important aspect of A DevOps-focused engineering team is the ability to?

A

Fail well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The ability to fail well has more to do with people or tooling?

A

People

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the most interesting aspect of Kaizen?

A

It embraces non-catastrophic failure and accepts that the process isn’t perfect and you always have room to refine and improve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is accountability in DevOps?

A

Accountability means taking ownership over your work, your team, and
your organization.

Everyone has a role to play in continuous improvement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Ergonomics

A

The study of human psychology and physiology in design.

20
Q

Cognitive ergonomics

A

The study of how humans perceive and reason about
their environment.

How do people make decisions or react to certain
stimuli? What makes one person extremely reliable and another flaky?

21
Q

Organizational ergonomics

A

The study of systems and structures inside organizations.

How do teams communicate and work together? What makes some teams cooperative and others competitive?

22
Q

Overengineering

A

Code that’s overworked or solutions that are unnecessarily verbose or complex.

23
Q

What are the warning signs that a solution is overengineered.

A

The problem is more easily managed manually.
The code is unusually verbose. (expressed in more words than needed)
The solution wasn’t peer-reviewed.
The code is difficult to understand.
A free or cheap tool exists that solves the problem.

24
Q

What to include within an incident checklist?

A

Notify appropriate colleagues.
Deploy a status page to customers
Rate the incident to help the first responders appropriately.
Schedule a post-incident review.

25
Q

What are services?

A

A set of related systems operated for users

26
Q

Who are responsible for the health of the services?

A

Site Reliability Engineers(SRE)

27
Q

What range of activities successfully operating a service entails?

A

Developing monitoring systems,

planning capacity,

responding to incidents,

ensuring the root causes of outages are addressed

28
Q

How can we characterize the health of a service?

A

From the most basic requirements needed for a system to function as a service at all to the higher levels of function taking active control of the direction of the service rather than reactively fighting fires.

This understanding is so fundamental to how Google evaluates services
that it wasn’t explicitly developed until a number of Google SREs needed a
way to explain how to increase systems’ reliability.

29
Q

From top to bottom what are the Service Reliability Hierarchy steps?

A

Product, Development, Capacity Planning, Testing + Releasing procedures, Postmortem / Root Cause Analysis, Incident Response, and Monitoring

30
Q

Why is monitoring important?

A

Without monitoring, you have no way to tell whether the service is even working; absent a thoughtfully designed monitoring infrastructure, you’re flying blind.

You want to be aware of the issues before your users notice them.

31
Q

What fundamental to running a stable service?

A

Monitoring

32
Q

What does monitoring enables service owners to do?

A

Make rational decisions about the impact of changes to the service, apply the scientific method to incident response, and of course ensure their reason for existence: to measure the service’s alignment with business goals.

33
Q

What are the reasons why monitoring a very large system is challenging?

A

The sheer number of components being analyzed

The need to maintain a reasonably low maintenance burden on the engineers responsible for the system

34
Q

Google’s monitoring systems don’t just measure simple metrics, such as the average response time of a web server what else do they need to do?

A

They also need to understand the distribution of those response times across all web servers in that region. This knowledge enables us to identify the factors contributing to the latency tail.

35
Q

At the scale our systems operate, why is being alerted for single-machine failures unacceptable?

A

Such data is too Noisy to be actionable.

Instead, they try to build systems that are robust against failures in the systems they depend on.

36
Q

Rather than requiring management of many individual components, what should a large system do instead?

A

Be designed to aggregate signals and cut outliers.

They need monitoring systems that allow them to alert for high-level service objectives but retain the granularity to inspect individual components as needed.

37
Q

How did Google’s monitoring systems evolve over the course of 10 years?

A

From the traditional model of custom scripts that check responses and alerts, wholly separated from the visual display of trends, to a new paradigm.

This new model made the collection of time series the first-class role in the monitoring system and replaced those check scripts with a rich language for manipulating time series into charts and alerts.

38
Q

On-call duty

A

Being available for calls during both working and non-working hours.

39
Q

In the IT context, on-call activities have historically been performed by?

A

Dedicated Ops teams tasked with the primary responsibility of keeping the service(s) for which they are responsible in good health.

40
Q

How are the google SRE team different from the purely operational team?

A

They place heavy emphasis on the use of engineering to approach problems.

These problems, which typically fall in the operational domain, exist at a scale that would be intractable without software engineering solutions.

41
Q

As the guardians of production systems, on-call engineers take care of?

A

Their assigned operations by managing outages that affect the team and performing and/or vetting production changes.

42
Q

When on-call an engineer is available to do what?

A

Perform operations on production systems within minutes, according to the paging response times agreed to by the team and the business system owners.

43
Q

What are the typical values for user-facing or otherwise highly time critical services, and for less time-sensitive systems?

A

5 minutes for user-facing or otherwise highly time-critical services, and 30 minutes for less time-sensitive systems.

44
Q

What are response times for an on-call engineer related to?

A

Desired service availability

For example: if a user-facing system must obtain 4 nines of availability in a given quarter (99.99%), the allowed quarterly downtime is around 13 minutes.

For systems with more relaxed SLOs, the reaction time can be in the order of tens of minutes.

45
Q

As soon as a page is received and acknowledge, what is the on-call engineer expected to do?

A

Triage the problem and work toward its resolution, possibly involving other team members and escalating as needed.