Monitoring (L13) Flashcards
(27 cards)
What is System Monitoring?
System monitoring is the continuous process of observing and analyzing a system’s
performance, resource usage, and health to make sure it’s running smoothly and efficiently.
Why is monitoring critical in distributed systems?
Because distributed systems are complex and dynamic, monitoring helps with:
● Managing complexity across many services and machines
● Detecting partial failures (since parts may fail independently)
● Ensuring performance stays within acceptable limits
● Debugging quickly when things go wrong
● Managing resources like CPU, memory, and storage to prevent overload
Differentiate between Monitoring, Logging, and Tracing.
● Monitoring: Gives a high-level view of system health over time
● Logging: Records specific events or messages for detailed inspection
● Tracing: Tracks the path of individual requests across system components
What is System Performance?
It describes how well a distributed system handles tasks and requests while meeting quality expectations like speed, uptime, and reliability.
Name the Key Performance Attributes.
● Latency: How fast requests are processed
● Throughput: How many requests are handled over time
● Availability: How often the system is up and running
● Scalability: How well the system handles growth
● Resource Utilization: How efficiently resources are used
What are Metrics?
Metrics are quantitative values (numbers) collected from a system that tell us how it’s behaving—like CPU usage, request counts, or error rates.
What are the three main types of Metrics?
- Resource Metrics: Track infrastructure (CPU, memory, disk, network)
- Application Metrics: Track service behavior (latency, errors, requests/sec)
- Business Metrics: Domain-specific, like logins per hour or purchases made
What are percentile metrics (like p95, p99) used for?
To measure the experience of the worst-case scenarios. For example, p99 latency shows the time within which 99% of requests are served, highlighting the experience for the 1% slowest requests.
Differentiate between Latency and Throughput.
● Latency: Time taken to respond to a single request
● Throughput: Total number of requests handled per time unit
Differentiate between Bandwidth and Throughput.
● Bandwidth: The maximum possible data transfer rate
● Throughput: The actual achieved data transfer rate (usually lower than bandwidth)
What are One-way Latency and Round-Trip Time (RTT)?
● One-way Latency: Time for a message to travel from sender to receiver
● RTT: Time for a message to go to the receiver and back
Differentiate between Availability and Reliability.
● Availability: How often the system is **up and accessible **
● Reliability: How long the system can run without failure or errors
What is Error Rate?
The percentage of failed requests out of the total number.
Formula:
Error Rate = (Failed Requests / Total Requests) × 100
What do 4xx and 5xx errors typically indicate?
● 4xx: Client-side issues (e.g., bad request, unauthorized)
● 5xx: Server-side issues (e.g., internal error, service crash)
What are Round Complexity and Communication Complexity in Secure Distributed Systems?
● Round Complexity: Number of communication steps needed (affects speed & fault
tolerance)
● Communication Complexity: Total data exchanged (affects bandwidth usage)
What are the four main categories of Monitoring Methods/Techniques?
- Metrics Monitoring: Tracks system values like CPU or latency
- Log Monitoring: Analyzes logs for errors and behaviors
- Distributed Tracing: Follows requests across services
- Event Monitoring: Watches for system events like crashes or scale-ups
What is Observability in modern systems?
Observability is the ability to understand the system’s internal state just by looking at outputs like metrics, logs, traces, and events.
Why do we use System Modeling?
To simulate and predict how a system behaves without needing to test on the real system. It helps find bottlenecks, analyze reliability, and make informed design choices.
How do Modeling and Monitoring techniques differ?
● Modeling: Used for prediction and planning (before deployment)
● Monitoring: Used for real-time observation (during operation)
Name four common types of models in distributed systems.
- Queuing Models: Predict load and latency
- Failure Models: Simulate types of faults
- Network Models: Estimate delays and errors in data flow
- Workload Models: Test how the system handles different usage patterns
What happens in a queuing model if utilization (ρ) is greater than 1?
The system becomes overloaded. Queues grow uncontrollably and latency increases rapidly.
Name some types of Failure Models.
● Crash Failure: System stops working
● Omission Failure: Messages or responses get lost
● Timing Failure: Responses come too early or too late
● Response Failure: System gives incorrect results
● Byzantine Failure: Components behave maliciously or unpredictably
What are Queuing Models used for?
To understand how requests are queued and processed, and to predict performance using:
●λ (Lambda): Arrival rate
● μ (Mu): Service rate
● ρ (Rho): Utilization = λ / μ
What might “High throughput + High latency” indicate?
Likely a resource bottleneck, such as CPU, memory, or disk under strain.