Datadog Flashcards
(13 cards)
Tags
labels that can be attached to your data to add context, so that you can filter, group, and correlate your data throughout Datadog.
Kubernetes
Creates dynamic environments by spinning up containers with ignificantly shorter lifespans than physical hosts.
Cloud Computing
Cloud computing lets you rent computing power and storage on-demand, instead of owning and maintaining your own physical hardware.
It is the delivery of computing services—like servers, storage, databases, networking, software, and analytics—over the internet (“the cloud”) instead of on local machines or data centers.
Datadog
Datadog is a cloud-based monitoring and observability platform.
It helps teams monitor, troubleshoot, and optimize the performance of their applications, infrastructure, and services.
It brings together logs, metrics, traces, and more into a single pane of glass—so that teams can understand what’s happening across their stack, detect issues faster, and collaborate more effectively during incidents.
Metrics
Metrics are the quantitative backbone of monitoring. They are structured, numeric data points collected at regular intervals—ideal for detecting trends, establishing baselines, and triggering alerts. Examples: request latency, error rates, CPU usage.
Relationship to Monitoring:
Monitoring often starts with metrics. You use them to define “healthy” versus “unhealthy” states, and to track system performance over time.
Monitoring
Monitoring tells you when something breaks, and answers the question, “Is this system working as expected right now?” Its goal is to detect and surface known issues quickly, enabling teams to respond to problems as they arise.
Observability
The ability to analyze and measure the internal states of systems based on their outputs and interactions across assets. The three pillars of observability—metrics, logs, and traces—are used to gain insights into system behavior, identify and resolve performance and security issues to improve system reliability, and provide visibility into complex systems to help identify bottlenecks and failures.
Observability provides visibility and actionable insights into IT performance, security, delivery, reliability, and costs to help organizations address this increasing complexity.
Logs
Logs are detailed chronological records of specific events that occur within a system. Logs offer a granular view of what happened within the system, helping in debugging and understanding specific events. Examples include error messages, transaction records, and system events.
Trace
Traces are records that track the flow of a request from the frontend to the backend through various components of a system. Traces help in understanding the path and performance of requests, identifying bottlenecks, and diagnosing latency issues. For example, a trace might show how a user request travels through different microservices.
Agents (Collectors)
Agents are a software component that collects and routes telemetry data from systems, applications, or processes. The data is refined, standardized, enriched, tagged, and then exported to an observability platform. A single agent that is able to collect, process, and route multiple telemetry types can provide consistent data collection across technology stacks and enhance correlation and troubleshooting. Agents should have low overhead (consuming few CPU and memory resources to avoid impacting system performance), they should be secure by design, and they should be easy to deploy and manage at any scale, via configuration files or remotely.
OpenTelemetry
An increasingly adopted open source project that provides a set of vendor-neutral standards, APIs, SDKs, and tools for collecting and transferring telemetry data (such as metrics, logs, and traces) from cloud-native applications to various observability platforms.
Telemetry
Telemetry collects and analyzes data from remote sources to gain insights about a system’s performance
AIOps
AI for IT operations (AIOps) that is embedded into observability solutions uses machine learning algorithms to automatically detect anomalies, surface outliers, and find the root cause and blast radius of incidents, helping teams reduce mean time to detection (MTTD) and mean time to resolution (MTTR).