Monitoring & Observability

Monitoring tells you when something is wrong. Observability tells you why. A production system without observability is a black box: you are flying blind. The three pillars of observability (metrics, logs, and traces) give you the visibility needed to operate, debug, and optimize distributed systems at scale.

Monitoring vs. Observability

Monitoring

  • Answers: "Is the system healthy?"
  • Predefined dashboards, alerts, and thresholds.
  • Reactive: detects known failure modes.
  • Good for known-unknowns.

Observability

  • Answers: "Why is the system unhealthy?"
  • Ability to ask arbitrary questions of the system.
  • Proactive: debug novel, unexpected failures.
  • Essential for unknown-unknowns.

The Three Pillars

Metrics Numeric measurements over time intervals Prometheus, Datadog CloudWatch, Grafana Logs Discrete events with context and detail ELK Stack, Loki Splunk, CloudWatch Traces Request path across services with timing Jaeger, Zipkin AWS X-Ray, Tempo

Metrics

Metrics are numeric values measured over time. They are lightweight, aggregatable, and ideal for dashboards and alerts.

  • Counter: Monotonically increasing value (total requests, total errors). Only goes up.
  • Gauge: Value that can go up or down (current memory usage, active connections, queue depth).
  • Histogram: Distribution of values (request latency in buckets: 0-10ms, 10-50ms, 50-100ms, etc.).
  • Summary: Pre-calculated quantiles (p50, p95, p99 latency) computed on the client side.

The USE Method (for infrastructure)

For every resource (CPU, memory, disk, network), check:

  • Utilization: What percentage of the resource is in use?
  • Saturation: How much extra work is queued?
  • Errors: How many errors have occurred?

The RED Method (for services)

For every service endpoint, track:

  • Rate: Requests per second.
  • Errors: Failed requests per second.
  • Duration: Latency distribution (p50, p95, p99).

Logs

Logs record discrete events with rich context. They are essential for debugging but expensive at scale.

Structured Log (JSON){ "timestamp": "2024-01-15T10:23:45.123Z", "level": "ERROR", "service": "payment-service", "trace_id": "abc123def456", "span_id": "span-789", "user_id": "u-42", "message": "Payment processing failed", "error": "CardDeclined", "latency_ms": 1250, "amount": 59.99, "currency": "USD" }
Structured Logging

Always use structured (JSON) logs rather than free-text logs. Structured logs can be indexed, searched, and aggregated by machines. Include correlation IDs (trace_id, request_id) to link logs across services.

Distributed Traces

A trace follows a single request as it traverses multiple services. Each service creates a "span" with a start time, duration, and metadata. Spans are linked by a shared trace ID.

1
Trace starts: API gateway generates a unique trace_id and creates the root span.
2
Propagation: The trace_id is passed in HTTP headers (e.g., traceparent) to downstream services.
3
Child spans: Each service creates a child span with the same trace_id, recording its own latency and metadata.
4
Collection: Spans are shipped to a tracing backend (Jaeger, Tempo) which stitches them into a trace timeline.
5
Visualization: A waterfall view shows exactly where time was spent across services, revealing bottlenecks.

SLIs, SLOs, and SLAs

TermDefinitionExample
SLI (Service Level Indicator)A quantitative measure of a specific aspect of service quality.p99 latency, error rate, availability
SLO (Service Level Objective)A target value or range for an SLI. Internal commitment."p99 latency < 200ms", "99.95% availability"
SLA (Service Level Agreement)A contract with consequences for violating the SLO. External commitment to customers."99.9% uptime or service credits issued"

Error Budgets

If your SLO is 99.9% availability, you have a 0.1% error budget. This means you can afford ~43 minutes of downtime per month. Error budgets balance reliability with development velocity:

  • If the error budget is healthy, teams ship features faster.
  • If the budget is nearly exhausted, teams focus on reliability work.

Alerting

Alert Design Principles

  • Alert on symptoms, not causes. Alert on "error rate > 5%" (symptom), not "CPU > 80%" (cause). Users care about symptoms.
  • Every alert must be actionable. If a human cannot take action, it should not be an alert: make it a dashboard metric instead.
  • Avoid alert fatigue. Too many alerts desensitize on-call engineers. Ruthlessly prune noisy alerts.
  • Use multiple severity levels. Critical (page immediately), Warning (investigate soon), Info (awareness only).

Alerting Pipeline

Metric collected (Prometheus) -> Alerting rule evaluated (e.g., error_rate > 0.05 for 5m) -> Alert fires (Alertmanager) -> Notification routed (PagerDuty, Slack, email) -> On-call engineer responds -> Incident managed, postmortem written

Observability Stack Example

LayerOpen SourceManaged / SaaS
MetricsPrometheus + GrafanaDatadog, New Relic, CloudWatch
LogsELK (Elasticsearch, Logstash, Kibana), LokiSplunk, Datadog Logs, CloudWatch Logs
TracesJaeger, Zipkin, Grafana TempoAWS X-Ray, Datadog APM, Honeycomb
InstrumentationOpenTelemetry (unified SDK)Vendor-specific agents
AlertingAlertmanager, Grafana AlertingPagerDuty, Opsgenie, VictorOps
OpenTelemetry

OpenTelemetry (OTel) is the emerging standard for instrumentation. It provides a single SDK that generates metrics, logs, and traces in a vendor-neutral format. Instrument your code with OTel once, and export to any backend (Prometheus, Jaeger, Datadog, etc.).

Key Takeaways

  • The three pillars of observability are metrics (numeric time-series), logs (discrete events), and traces (request paths across services).
  • Use the USE method for infrastructure resources and the RED method for service endpoints.
  • Define SLIs and SLOs to measure reliability objectively. Use error budgets to balance velocity and reliability.
  • Alert on symptoms, make every alert actionable, and guard against alert fatigue.
  • Use structured logging and distributed tracing with correlation IDs to debug issues across microservices.
  • OpenTelemetry provides vendor-neutral instrumentation for all three pillars.

Chapter Check-Up

Quick quiz to reinforce what you just learned.

๐Ÿงช

Practice What You Learned

Set up Prometheus, Grafana, and distributed tracing in our observability guided lab.

Start Guided Lab โ†’