Monitoring (L13) Flashcards

(27 cards)

1
Q

What is System Monitoring?

A

System monitoring is the continuous process of observing and analyzing a system’s
performance, resource usage, and health to make sure it’s running smoothly and efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is monitoring critical in distributed systems?

A

Because distributed systems are complex and dynamic, monitoring helps with:
Managing complexity across many services and machines
Detecting partial failures (since parts may fail independently)
Ensuring performance stays within acceptable limits
Debugging quickly when things go wrong
Managing resources like CPU, memory, and storage to prevent overload

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Differentiate between Monitoring, Logging, and Tracing.

A

Monitoring: Gives a high-level view of system health over time
Logging: Records specific events or messages for detailed inspection
Tracing: Tracks the path of individual requests across system components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is System Performance?

A

It describes how well a distributed system handles tasks and requests while meeting quality expectations like speed, uptime, and reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name the Key Performance Attributes.

A

Latency: How fast requests are processed
Throughput: How many requests are handled over time
Availability: How often the system is up and running
Scalability: How well the system handles growth
Resource Utilization: How efficiently resources are used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are Metrics?

A

Metrics are quantitative values (numbers) collected from a system that tell us how it’s behaving—like CPU usage, request counts, or error rates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the three main types of Metrics?

A
  1. Resource Metrics: Track infrastructure (CPU, memory, disk, network)
  2. Application Metrics: Track service behavior (latency, errors, requests/sec)
  3. Business Metrics: Domain-specific, like logins per hour or purchases made
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are percentile metrics (like p95, p99) used for?

A

To measure the experience of the worst-case scenarios. For example, p99 latency shows the time within which 99% of requests are served, highlighting the experience for the 1% slowest requests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Differentiate between Latency and Throughput.

A

Latency: Time taken to respond to a single request
Throughput: Total number of requests handled per time unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Differentiate between Bandwidth and Throughput.

A

Bandwidth: The maximum possible data transfer rate
Throughput: The actual achieved data transfer rate (usually lower than bandwidth)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are One-way Latency and Round-Trip Time (RTT)?

A

One-way Latency: Time for a message to travel from sender to receiver
RTT: Time for a message to go to the receiver and back

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Differentiate between Availability and Reliability.

A

Availability: How often the system is **up and accessible **
Reliability: How long the system can run without failure or errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Error Rate?

A

The percentage of failed requests out of the total number.
Formula:
Error Rate = (Failed Requests / Total Requests) × 100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do 4xx and 5xx errors typically indicate?

A

4xx: Client-side issues (e.g., bad request, unauthorized)
5xx: Server-side issues (e.g., internal error, service crash)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are Round Complexity and Communication Complexity in Secure Distributed Systems?

A

Round Complexity: Number of communication steps needed (affects speed & fault
tolerance)
Communication Complexity: Total data exchanged (affects bandwidth usage)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the four main categories of Monitoring Methods/Techniques?

A
  1. Metrics Monitoring: Tracks system values like CPU or latency
  2. Log Monitoring: Analyzes logs for errors and behaviors
  3. Distributed Tracing: Follows requests across services
  4. Event Monitoring: Watches for system events like crashes or scale-ups
17
Q

What is Observability in modern systems?

A

Observability is the ability to understand the system’s internal state just by looking at outputs like metrics, logs, traces, and events.

18
Q

Why do we use System Modeling?

A

To simulate and predict how a system behaves without needing to test on the real system. It helps find bottlenecks, analyze reliability, and make informed design choices.

19
Q

How do Modeling and Monitoring techniques differ?

A

Modeling: Used for prediction and planning (before deployment)
Monitoring: Used for real-time observation (during operation)

20
Q

Name four common types of models in distributed systems.

A
  1. Queuing Models: Predict load and latency
  2. Failure Models: Simulate types of faults
  3. Network Models: Estimate delays and errors in data flow
  4. Workload Models: Test how the system handles different usage patterns
21
Q

What happens in a queuing model if utilization (ρ) is greater than 1?

A

The system becomes overloaded. Queues grow uncontrollably and latency increases rapidly.

22
Q

Name some types of Failure Models.

A

Crash Failure: System stops working
Omission Failure: Messages or responses get lost
Timing Failure: Responses come too early or too late
Response Failure: System gives incorrect results
Byzantine Failure: Components behave maliciously or unpredictably

23
Q

What are Queuing Models used for?

A

To understand how requests are queued and processed, and to predict performance using:
λ (Lambda): Arrival rate
μ (Mu): Service rate
ρ (Rho): Utilization = λ / μ

24
Q

What might “High throughput + High latency” indicate?

A

Likely a resource bottleneck, such as CPU, memory, or disk under strain.

25
What might "Low latency + High error rates" indicate?
Possible bugs, misconfigurations, or bad data causing failures even though responses are fast.
26
If latency rises with high CPU usage, what action is recommended?
**Scale up** by adding more servers, containers, or compute resources.
27
If throughput drops but resources seem fine, what action is recommended?
**Optimize the code**, database queries, or business logic to remove inefficiencies.