Designing for Reliability Flashcards

Question 1

Q

As an SRE, you are assigned to support several applications. In the past, these applications
have had significant reliability problems. You would like to understand the performance
characteristics of the applications, so you create a set of dashboards. What kind of data
would you display on those dashboards?
A. Metrics and time-series data measuring key performance attributes, such as CPU
utilization
B. Detailed log data from syslog
C. Error messages output from each application
D. Results from the latest acceptance tests

Answer

A

A. The correct answer is A. If the goal is to understand performance characteristics, then
metrics, particularly time-series data, will show the values of key measurements associated with performance, such as utilization of key resources. Option B is incorrect because
detailed log data describes significant events but does not necessarily convey resource utilization or other performance-related data. Option C is incorrect because errors are types of
events that indicate a problem but are not helpful for understanding normal, baseline operations. Option D is incorrect because acceptance tests measure how well a system meets
business requirements but does not provide point-in-time performance information.

Question 2

Q

After determining the optimal combination of CPU and memory resources for nodes in a
Kubernetes cluster, you want to be notified whenever CPU utilization exceeds 85 percent
for 5 minutes or when memory utilization exceeds 90 percent for 1 minute. What would
you have to specify to receive such notifications?
A. An alerting condition
B. An alerting policy
C. A logging message specification
D. An acceptance test

Answer

A

B. The correct answer is B. Alerting policies are sets of conditions, notification specifications, and selection criteria for determining resources to monitor. Option A is incorrect
because one or more conditions are necessary but not sufficient. Option C is incorrect
because a log message specification describes the content written to a log when an event
occurs. Option D is incorrect because acceptance tests are used to assess how well a system
meets business requirements; it is not related to alerting.

Question 3

Q

A compliance review team is seeking information about how your team handles high-risk
administration operations, such as granting operating system users root privileges. Where
could you find data that shows your team tracks changes to user privileges?
A. In metric time-series data
B. In alerting conditions
C. In audit logs
D. In ad hoc notes kept by system administrators

Answer

A

C. The correct answer is C. Audit logs would contain information about changes to user
privileges, especially privilege escalations such as granting root or administrative access.
Option A and Option B are incorrect, as neither records detailed information about access
control changes. Option D may have some information about user privilege changes, but
notes may be changed and otherwise tampered with, so o

Question 4

Q

Release management practices contribute to improving reliability by which one of the
following?
A. Advocating for object-oriented programming practices
B. Enforcing waterfall methodologies
C. Improving the speed and reducing the cost of deploying code
D. Reducing the use of stateful services

Answer

A

C. The correct option is C. Release management practices reduce manual effort to
deploy code. This allows developers to roll out code more frequently and in smaller
units and, if necessary, quickly roll back problematic releases. Option A is incorrect
because release management is not related to programming paradigms. Option B is
incorrect because release management does not require waterfall methodologies. Option
D is incorrect. Release management does not influence the use of stateful or stateless
services.

Question 5

Q

A team of software engineers is using release management practices. They want developers
to check code into the central team code repository several times during the day. The team
also wants to make sure that the code that is checked in is functioning as expected before
building the entire application. What kind of tests should the team run before attempting to
build the application?
A. Unit tests
B. Stress tests
C. Acceptance tests
D. Compliance tests

Answer

A

A. The correct answer is A. These are tests that check the smallest testable unit of code.
These tests should be run before any attempt to build a new version of an application.
Option B is incorrect because a stress test could be run on the unit of code, but it is more
than what is necessary to test if the application should be built. Option C is incorrect
because acceptance tests are used to confirm that business requirements are met; a build
that only partially meets business requirements is still useful for developers to create.
Option D is incorrect because compliance tests is a fictitious term and not an actual class of
tests used in release management.

Question 6

Q

Developers have just deployed a code change to production. They are not routing any traffic
to the new deployment yet, but they are about to send a small amount of traffic to servers
running the new version of code. What kind of deployment are they using?
A. Blue/Green deployment
B. Before/After deployment
C. Canary deployment
D. Stress deployment

Answer

A

C. The correct answer is C. This is a canary deployment. Option A is incorrect because
Blue/Green deployment uses two fully functional environments and all traffic is routed to
one of those environments at a time. Option B and Option D are incorrect because they are
not actual names of deployment types.

Question 7

Q

You have been hired to consult with an enterprise software development that is starting to
adopt agile and DevOps practices. The developers would like advice on tools that they can
use to help them collaborate on software development in the Google Cloud. What version
control software might you recommend?
A. Jenkins and Cloud Source Repositories
B. Syslog and Cloud Build
C. GitHub and Cloud Build
D. GitHub and Cloud Source Repositories

Answer

A

D. The correct answer is D. GitHub and Cloud Source Repositories are version control systems. Option A is incorrect because Jenkins is a CI/CD tool, not a version control system.
Option B is incorrect because neither Syslog nor Cloud Build is a version control system.
Option C is incorrect because Cloud Build is not a version control system.

Question 8

Q

A startup offers a software-as-a-service solution for enterprise customers. Many of the components of the service are stateful, and the system has not been designed to allow incremental rollout of new code. The entire environment has to be running the same version of the
deployed code. What deployment strategy should they use?
A. Rolling deployment
B. Canary deployment
C. Stress deployment
D. Blue/Green deployment

Answer

A

D. The correct answer is D. A Blue/Green deployment is the kind of deployment that allows
developers to deploy new code to an entire environment before switching traffic to it.
Option A and Option B are incorrect because they are incremental deployment strategies.
Option C is not an actual deployment strategy.

Question 9

Q

A service is experiencing unexpectedly high volumes of traffic. Some components of the
system are able to keep up with the workload, but others are unable to process the volume
of requests. These services are returning a large number of internal server errors. Developers need to release a patch as soon as possible that provides some relief for an overloaded
relational database service. Both memory and CPU utilization are near 100 percent. Horizontally scaling the relational database is not an option, and vertically scaling the database
would require too much downtime. What strategy would be the fastest to implement?
A. Shed load
B. Increase connection pool size in the database
C. Partition the workload
D. Store data in a Pub/Sub topic

Answer

A

A. The correct option is A. The developers should create a patch to shed load. Option B
would not solve the problem, since more connections would allow more clients to connect to the database, but CPU and memory are saturated, so no additional work can be
done. Option C could be part of a long-term architecture change, but it could not be
implemented quickly. Option D could also be part of a longer-term solution to allow
a database to buffer requests and process them at a rate allowed by available database
resources.

Question 10

Q

A service has detected that a downstream process is returning a large number of errors. The
service automatically slows down the number of messages it sends to the downstream process. This is an example of what kind of strategy?
A. Load shedding
B. Upstream throttling
C. Rebalancing
D. Partitioning

Answer

A

B. The correct answer is B. This is an example of upstream or client throttling. Option A
is incorrect because load is not shed; rather, it is just delayed. Option C is incorrect. There
is no rebalancing of load, such as might be done on a Kafka topic. Option D is incorrect.
There is no mention of partitioning data.

Designing for Reliability Flashcards

(10 cards)