Observability Flashcards
4 Primary Aspects of Observability
Logs, Metrics, Alarms, Events
Logs
a list of application information, system performance, or user activities that provide a verbose, running commentary of activity
What interface do we use to query and visualize the log data stored in Elasticsearch?
Kibana
What are logs most useful for?
Logs are very useful for diagnosing errors because they contain detailed data about activity as well as stack traces.
Are logs or metrics more costly? Why?
Logs contain detailed information which makes them much more costly than metrics, so we should use logs sparingly. For example, we should use a metric to record that a request succeeded. We should use a log when a request fails, because then the stack trace and other context will be useful.
Metrics
a set of numbers, measured over intervals of time, that give information about a particular process or activity
3 Metric Types
Counters, Timers, Gauges
Counters
statsd events which can count. The two most useful outputs are count:sum (sum of all increments) and count:rate (sum of all increments, normalized to increments per second)
How many times did an event occur?
eg. How many times was the database accessed?
Timers
pre-computed quartiles/percentiles based on events coming into statsd. They may not represent time, but are used whenever you need to understand the distribution of a piece of data (such as finding the median, 95th percentile, etc). Timer metric names are suffixed with timer:[computed value] (for example, timer:p99).
eg. What was the average response time when accessing the database?
eg. How many miles was the average ride?
Gauges
views over an internal value of an application. For most purposes, gauge:mean will be the most useful, representing the arithmetic mean of all received gauge values.
e.g. How many database connections are in our connection pool?
Metric names all start with ______________
the environment they are in, specifically production or staging
Alarms / Alerts
notifications sent to the on-call engineer when the data emitted by one or more metrics passes a certain threshold. The terms “alert” and “alarm” are used interchangeably
Every alert has a condition that determines when it will fire, and that condition is based on ________
one or more metrics.
What tool do we use to view alerts?
Lighthouse
What tool do we use to manage incidents?
PagerDuty
Events
structured, nestable key-value pairs that contain information about what it took for a service (e.g. a mobile app, backend app, web client, etc.) to perform a unit of work. Events capture rich details regarding system state changes or any significant occurrences that need to be tracked and propagated.
In that way, events are similar to structured logs (a log with some key-value structure like JSON), but operate at a higher abstraction.
Whereas a structured log would have information about an individual action that happens in a service — for example, that a service launch succeeded or failed — events contain useful metadata information, such as how long it took to launch the service, or how many times the service failed before it gave up.
Hive Query
the process for accessing older logs
AWS S3
used as our backup logging datastore. Logs are kept on S3 for 120 days
Presto
Query engine used to search through the AWS S3 logs
Spellbook
Used to issue search queries
Start a new query in Spellbook. Choose Presto as the Database and raw_events as the Schema. You can then search logs in the event_logging_event Hive table based on date.
M3
Our in-house Timeseries database based on Open-source M3 to store our real-time metrics and alerts. We use statsd to aggregate events and relay them to M3 via the wavefrontproxy service.
Metrics can be queried on Lighthouse using the query builder. Queries are written in the open-source PromQL query language- see the query language reference for details.
The charts in Grafana dashboards (such as the Lyft dashboard) are built with M3 queries. To add a new chart to a Grafana dashboard, you’ll need to write its M3 query first. M3 querying can be very useful in tracking everything from communication between services to ride cancellations.
Dashboards (Grafana)
The primary visual tool for service health monitoring
4 Types of Dashboards in Grafana
Default service dashboards, Managed dashboards, Deploy dashboards, Code Health dashboards
Default service dashboards
standardized dashboards generated for all services that have a pillar defined under ops/config/pillar/per_service. These dashboards are intended to provide a good starting point for monitoring standard Python or Go services. Service owners can configure many aspects of their dashboards by updating settings in the pillar.