Monitoring (L13) Flashcards by Adil Musali

What is System Monitoring?

System monitoring is the continuous process of observing and analyzing a system’s
performance, resource usage, and health to make sure it’s running smoothly and efficiently.

How well did you know this?

Not at all

Perfectly

Why is monitoring critical in distributed systems?

Because distributed systems are complex and dynamic, monitoring helps with:
● Managing complexity across many services and machines
● Detecting partial failures (since parts may fail independently)
● Ensuring performance stays within acceptable limits
● Debugging quickly when things go wrong
● Managing resources like CPU, memory, and storage to prevent overload

How well did you know this?

Not at all

Perfectly

Differentiate between Monitoring, Logging, and Tracing.

● Monitoring: Gives a high-level view of system health over time
● Logging: Records specific events or messages for detailed inspection
● Tracing: Tracks the path of individual requests across system components

How well did you know this?

Not at all

Perfectly

What is System Performance?

It describes how well a distributed system handles tasks and requests while meeting quality expectations like speed, uptime, and reliability.

How well did you know this?

Not at all

Perfectly

Name the Key Performance Attributes.

● Latency: How fast requests are processed
● Throughput: How many requests are handled over time
● Availability: How often the system is up and running
● Scalability: How well the system handles growth
● Resource Utilization: How efficiently resources are used

How well did you know this?

Not at all

Perfectly

What are Metrics?

Metrics are quantitative values (numbers) collected from a system that tell us how it’s behaving—like CPU usage, request counts, or error rates.

How well did you know this?

Not at all

Perfectly

What are the three main types of Metrics?

Resource Metrics: Track infrastructure (CPU, memory, disk, network)
Application Metrics: Track service behavior (latency, errors, requests/sec)
Business Metrics: Domain-specific, like logins per hour or purchases made

How well did you know this?

Not at all

Perfectly

What are percentile metrics (like p95, p99) used for?

To measure the experience of the worst-case scenarios. For example, p99 latency shows the time within which 99% of requests are served, highlighting the experience for the 1% slowest requests.

How well did you know this?

Not at all

Perfectly

Differentiate between Latency and Throughput.

● Latency: Time taken to respond to a single request
● Throughput: Total number of requests handled per time unit

How well did you know this?

Not at all

Perfectly

Differentiate between Bandwidth and Throughput.

● Bandwidth: The maximum possible data transfer rate
● Throughput: The actual achieved data transfer rate (usually lower than bandwidth)

How well did you know this?

Not at all

Perfectly

What are One-way Latency and Round-Trip Time (RTT)?

● One-way Latency: Time for a message to travel from sender to receiver
● RTT: Time for a message to go to the receiver and back

How well did you know this?

Not at all

Perfectly

Differentiate between Availability and Reliability.

● Availability: How often the system is **up and accessible **
● Reliability: How long the system can run without failure or errors

How well did you know this?

Not at all

Perfectly

What is Error Rate?

The percentage of failed requests out of the total number.
Formula:
Error Rate = (Failed Requests / Total Requests) × 100

How well did you know this?

Not at all

Perfectly

What do 4xx and 5xx errors typically indicate?

● 4xx: Client-side issues (e.g., bad request, unauthorized)
● 5xx: Server-side issues (e.g., internal error, service crash)

How well did you know this?

Not at all

Perfectly

What are Round Complexity and Communication Complexity in Secure Distributed Systems?

● Round Complexity: Number of communication steps needed (affects speed & fault
tolerance)
● Communication Complexity: Total data exchanged (affects bandwidth usage)

How well did you know this?

Not at all

Perfectly

What are the four main categories of Monitoring Methods/Techniques?

Study These Flashcards

Metrics Monitoring: Tracks system values like CPU or latency
Log Monitoring: Analyzes logs for errors and behaviors
Distributed Tracing: Follows requests across services
Event Monitoring: Watches for system events like crashes or scale-ups

What is Observability in modern systems?

Study These Flashcards

Observability is the ability to understand the system’s internal state just by looking at outputs like metrics, logs, traces, and events.

Why do we use System Modeling?

Study These Flashcards

To simulate and predict how a system behaves without needing to test on the real system. It helps find bottlenecks, analyze reliability, and make informed design choices.

How do Modeling and Monitoring techniques differ?

Study These Flashcards

● Modeling: Used for prediction and planning (before deployment)
● Monitoring: Used for real-time observation (during operation)

Name four common types of models in distributed systems.

Study These Flashcards

Queuing Models: Predict load and latency
Failure Models: Simulate types of faults
Network Models: Estimate delays and errors in data flow
Workload Models: Test how the system handles different usage patterns

What happens in a queuing model if utilization (ρ) is greater than 1?

Study These Flashcards

The system becomes overloaded. Queues grow uncontrollably and latency increases rapidly.

Name some types of Failure Models.

Study These Flashcards

● Crash Failure: System stops working
● Omission Failure: Messages or responses get lost
● Timing Failure: Responses come too early or too late
● Response Failure: System gives incorrect results
● Byzantine Failure: Components behave maliciously or unpredictably

What are Queuing Models used for?

Study These Flashcards

To understand how requests are queued and processed, and to predict performance using:
●λ (Lambda): Arrival rate
● μ (Mu): Service rate
● ρ (Rho): Utilization = λ / μ

What might “High throughput + High latency” indicate?

Study These Flashcards

Likely a resource bottleneck, such as CPU, memory, or disk under strain.

What might "Low latency + High error rates" indicate?

Possible bugs, misconfigurations, or bad data causing failures even though responses are fast.

If latency rises with high CPU usage, what action is recommended?

**Scale up** by adding more servers, containers, or compute resources.

If throughput drops but resources seem fine, what action is recommended?

**Optimize the code**, database queries, or business logic to remove inefficiencies.

Monitoring (L13) Flashcards

(27 cards)