Monitoring and Observability | System Design Course

Monitoring vs. Observability

Monitoring

Answers: "Is the system healthy?"
Predefined dashboards, alerts, and thresholds.
Reactive: detects known failure modes.
Good for known-unknowns.

Observability

Answers: "Why is the system unhealthy?"
Ability to ask arbitrary questions of the system.
Proactive: debug novel, unexpected failures.
Essential for unknown-unknowns.

The Three Pillars

Metrics

Metrics are numeric values measured over time. They are lightweight, aggregatable, and ideal for dashboards and alerts.

Counter: Monotonically increasing value (total requests, total errors). Only goes up.
Gauge: Value that can go up or down (current memory usage, active connections, queue depth).
Histogram: Distribution of values (request latency in buckets: 0-10ms, 10-50ms, 50-100ms, etc.).
Summary: Pre-calculated quantiles (p50, p95, p99 latency) computed on the client side.

The USE Method (for infrastructure)

For every resource (CPU, memory, disk, network), check:

Utilization: What percentage of the resource is in use?
Saturation: How much extra work is queued?
Errors: How many errors have occurred?

The RED Method (for services)

For every service endpoint, track:

Rate: Requests per second.
Errors: Failed requests per second.
Duration: Latency distribution (p50, p95, p99).

Logs

Logs record discrete events with rich context. They are essential for debugging but expensive at scale.

Structured Log (JSON){
  "timestamp": "2024-01-15T10:23:45.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "span-789",
  "user_id": "u-42",
  "message": "Payment processing failed",
  "error": "CardDeclined",
  "latency_ms": 1250,
  "amount": 59.99,
  "currency": "USD"
}

Structured Logging

Always use structured (JSON) logs rather than free-text logs. Structured logs can be indexed, searched, and aggregated by machines. Include correlation IDs (trace_id, request_id) to link logs across services.

Distributed Traces

A trace follows a single request as it traverses multiple services. Each service creates a "span" with a start time, duration, and metadata. Spans are linked by a shared trace ID.

Trace starts: API gateway generates a unique trace_id and creates the root span.

Propagation: The trace_id is passed in HTTP headers (e.g., traceparent) to downstream services.

Child spans: Each service creates a child span with the same trace_id, recording its own latency and metadata.

Collection: Spans are shipped to a tracing backend (Jaeger, Tempo) which stitches them into a trace timeline.

Visualization: A waterfall view shows exactly where time was spent across services, revealing bottlenecks.

SLIs, SLOs, and SLAs

Term	Definition	Example
SLI (Service Level Indicator)	A quantitative measure of a specific aspect of service quality.	p99 latency, error rate, availability
SLO (Service Level Objective)	A target value or range for an SLI. Internal commitment.	"p99 latency < 200ms", "99.95% availability"
SLA (Service Level Agreement)	A contract with consequences for violating the SLO. External commitment to customers.	"99.9% uptime or service credits issued"

Error Budgets

If your SLO is 99.9% availability, you have a 0.1% error budget. This means you can afford ~43 minutes of downtime per month. Error budgets balance reliability with development velocity:

If the error budget is healthy, teams ship features faster.
If the budget is nearly exhausted, teams focus on reliability work.

Alerting

Alert Design Principles

Alert on symptoms, not causes. Alert on "error rate > 5%" (symptom), not "CPU > 80%" (cause). Users care about symptoms.
Every alert must be actionable. If a human cannot take action, it should not be an alert: make it a dashboard metric instead.
Avoid alert fatigue. Too many alerts desensitize on-call engineers. Ruthlessly prune noisy alerts.
Use multiple severity levels. Critical (page immediately), Warning (investigate soon), Info (awareness only).

Alerting Pipeline

Metric collected (Prometheus)
    -> Alerting rule evaluated (e.g., error_rate > 0.05 for 5m)
    -> Alert fires (Alertmanager)
    -> Notification routed (PagerDuty, Slack, email)
    -> On-call engineer responds
    -> Incident managed, postmortem written

Observability Stack Example

Layer	Open Source	Managed / SaaS
Metrics	Prometheus + Grafana	Datadog, New Relic, CloudWatch
Logs	ELK (Elasticsearch, Logstash, Kibana), Loki	Splunk, Datadog Logs, CloudWatch Logs
Traces	Jaeger, Zipkin, Grafana Tempo	AWS X-Ray, Datadog APM, Honeycomb
Instrumentation	OpenTelemetry (unified SDK)	Vendor-specific agents
Alerting	Alertmanager, Grafana Alerting	PagerDuty, Opsgenie, VictorOps

OpenTelemetry

OpenTelemetry (OTel) is the emerging standard for instrumentation. It provides a single SDK that generates metrics, logs, and traces in a vendor-neutral format. Instrument your code with OTel once, and export to any backend (Prometheus, Jaeger, Datadog, etc.).

Key Takeaways

The three pillars of observability are metrics (numeric time-series), logs (discrete events), and traces (request paths across services).
Use the USE method for infrastructure resources and the RED method for service endpoints.
Define SLIs and SLOs to measure reliability objectively. Use error budgets to balance velocity and reliability.
Alert on symptoms, make every alert actionable, and guard against alert fatigue.
Use structured logging and distributed tracing with correlation IDs to debug issues across microservices.
OpenTelemetry provides vendor-neutral instrumentation for all three pillars.

Monitoring & Observability

Monitoring vs. Observability

Monitoring

Observability

The Three Pillars

Metrics

The USE Method (for infrastructure)

The RED Method (for services)

Logs

Distributed Traces

SLIs, SLOs, and SLAs

Error Budgets

Alerting

Alert Design Principles

Alerting Pipeline

Observability Stack Example

Key Takeaways

Chapter Check-Up

Practice What You Learned