Observability Flashcards by Lex Slovik

4 Primary Aspects of Observability

Logs, Metrics, Alarms, Events

How well did you know this?

Not at all

Perfectly

Logs

a list of application information, system performance, or user activities that provide a verbose, running commentary of activity

How well did you know this?

Not at all

Perfectly

What interface do we use to query and visualize the log data stored in Elasticsearch?

Kibana

How well did you know this?

Not at all

Perfectly

What are logs most useful for?

Logs are very useful for diagnosing errors because they contain detailed data about activity as well as stack traces.

How well did you know this?

Not at all

Perfectly

Are logs or metrics more costly? Why?

Logs contain detailed information which makes them much more costly than metrics, so we should use logs sparingly. For example, we should use a metric to record that a request succeeded. We should use a log when a request fails, because then the stack trace and other context will be useful.

How well did you know this?

Not at all

Perfectly

Metrics

a set of numbers, measured over intervals of time, that give information about a particular process or activity

How well did you know this?

Not at all

Perfectly

3 Metric Types

Counters, Timers, Gauges

How well did you know this?

Not at all

Perfectly

Counters

statsd events which can count. The two most useful outputs are count:sum (sum of all increments) and count:rate (sum of all increments, normalized to increments per second)

How many times did an event occur?

eg. How many times was the database accessed?

How well did you know this?

Not at all

Perfectly

Timers

pre-computed quartiles/percentiles based on events coming into statsd. They may not represent time, but are used whenever you need to understand the distribution of a piece of data (such as finding the median, 95th percentile, etc). Timer metric names are suffixed with timer:[computed value] (for example, timer:p99).

eg. What was the average response time when accessing the database?
eg. How many miles was the average ride?

How well did you know this?

Not at all

Perfectly

Gauges

views over an internal value of an application. For most purposes, gauge:mean will be the most useful, representing the arithmetic mean of all received gauge values.

e.g. How many database connections are in our connection pool?

How well did you know this?

Not at all

Perfectly

Metric names all start with ______________

the environment they are in, specifically production or staging

How well did you know this?

Not at all

Perfectly

Alarms / Alerts

notifications sent to the on-call engineer when the data emitted by one or more metrics passes a certain threshold. The terms “alert” and “alarm” are used interchangeably

How well did you know this?

Not at all

Perfectly

Every alert has a condition that determines when it will fire, and that condition is based on ________

one or more metrics.

How well did you know this?

Not at all

Perfectly

What tool do we use to view alerts?

Lighthouse

How well did you know this?

Not at all

Perfectly

What tool do we use to manage incidents?

PagerDuty

How well did you know this?

Not at all

Perfectly

Events

structured, nestable key-value pairs that contain information about what it took for a service (e.g. a mobile app, backend app, web client, etc.) to perform a unit of work. Events capture rich details regarding system state changes or any significant occurrences that need to be tracked and propagated.

In that way, events are similar to structured logs (a log with some key-value structure like JSON), but operate at a higher abstraction.

Whereas a structured log would have information about an individual action that happens in a service — for example, that a service launch succeeded or failed — events contain useful metadata information, such as how long it took to launch the service, or how many times the service failed before it gave up.

Hive Query

the process for accessing older logs

AWS S3

used as our backup logging datastore. Logs are kept on S3 for 120 days

Presto

Query engine used to search through the AWS S3 logs

Spellbook

Used to issue search queries

Start a new query in Spellbook. Choose Presto as the Database and raw_events as the Schema. You can then search logs in the event_logging_event Hive table based on date.

Our in-house Timeseries database based on Open-source M3 to store our real-time metrics and alerts. We use statsd to aggregate events and relay them to M3 via the wavefrontproxy service.

Metrics can be queried on Lighthouse using the query builder. Queries are written in the open-source PromQL query language- see the query language reference for details.

The charts in Grafana dashboards (such as the Lyft dashboard) are built with M3 queries. To add a new chart to a Grafana dashboard, you’ll need to write its M3 query first. M3 querying can be very useful in tracking everything from communication between services to ride cancellations.

Dashboards (Grafana)

The primary visual tool for service health monitoring

4 Types of Dashboards in Grafana

Default service dashboards, Managed dashboards, Deploy dashboards, Code Health dashboards

Default service dashboards

standardized dashboards generated for all services that have a pillar defined under ops/config/pillar/per_service. These dashboards are intended to provide a good starting point for monitoring standard Python or Go services. Service owners can configure many aspects of their dashboards by updating settings in the pillar.

Managed dashboards

one-off, highly customizable dashboards and are kept in ops/config/states/managed_dashboards. They are used when a dashboard is required that does not fit into the usual paradigm of monitoring the health of a Python or Go service.

Deploy dashboards

show canary-only metrics, allowing you to compare the metrics of a deployed canary instance with the rest of the cluster. To generate a deploy dashboard for your service, add deploy_dashboard: True to the service pillar (example) and a dashboard named {service name}-deploy will be created in Grafana. Deploy dashboards are deployed during the 'extra' deploy step.

Code Health dashboards

show overall code health and statistics related to the deploy pipeline. Code Health dashboards are deployed during the 'extra' deploy step.