Monitoring vs. Observability
Monitoring
- Answers: "Is the system healthy?"
- Predefined dashboards, alerts, and thresholds.
- Reactive: detects known failure modes.
- Good for known-unknowns.
Observability
- Answers: "Why is the system unhealthy?"
- Ability to ask arbitrary questions of the system.
- Proactive: debug novel, unexpected failures.
- Essential for unknown-unknowns.
The Three Pillars
Metrics
Metrics are numeric values measured over time. They are lightweight, aggregatable, and ideal for dashboards and alerts.
- Counter: Monotonically increasing value (total requests, total errors). Only goes up.
- Gauge: Value that can go up or down (current memory usage, active connections, queue depth).
- Histogram: Distribution of values (request latency in buckets: 0-10ms, 10-50ms, 50-100ms, etc.).
- Summary: Pre-calculated quantiles (p50, p95, p99 latency) computed on the client side.
The USE Method (for infrastructure)
For every resource (CPU, memory, disk, network), check:
- Utilization: What percentage of the resource is in use?
- Saturation: How much extra work is queued?
- Errors: How many errors have occurred?
The RED Method (for services)
For every service endpoint, track:
- Rate: Requests per second.
- Errors: Failed requests per second.
- Duration: Latency distribution (p50, p95, p99).
Logs
Logs record discrete events with rich context. They are essential for debugging but expensive at scale.
{
"timestamp": "2024-01-15T10:23:45.123Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123def456",
"span_id": "span-789",
"user_id": "u-42",
"message": "Payment processing failed",
"error": "CardDeclined",
"latency_ms": 1250,
"amount": 59.99,
"currency": "USD"
}Always use structured (JSON) logs rather than free-text logs. Structured logs can be indexed, searched, and aggregated by machines. Include correlation IDs (trace_id, request_id) to link logs across services.
Distributed Traces
A trace follows a single request as it traverses multiple services. Each service creates a "span" with a start time, duration, and metadata. Spans are linked by a shared trace ID.
traceparent) to downstream services.SLIs, SLOs, and SLAs
| Term | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A quantitative measure of a specific aspect of service quality. | p99 latency, error rate, availability |
| SLO (Service Level Objective) | A target value or range for an SLI. Internal commitment. | "p99 latency < 200ms", "99.95% availability" |
| SLA (Service Level Agreement) | A contract with consequences for violating the SLO. External commitment to customers. | "99.9% uptime or service credits issued" |
Error Budgets
If your SLO is 99.9% availability, you have a 0.1% error budget. This means you can afford ~43 minutes of downtime per month. Error budgets balance reliability with development velocity:
- If the error budget is healthy, teams ship features faster.
- If the budget is nearly exhausted, teams focus on reliability work.
Alerting
Alert Design Principles
- Alert on symptoms, not causes. Alert on "error rate > 5%" (symptom), not "CPU > 80%" (cause). Users care about symptoms.
- Every alert must be actionable. If a human cannot take action, it should not be an alert: make it a dashboard metric instead.
- Avoid alert fatigue. Too many alerts desensitize on-call engineers. Ruthlessly prune noisy alerts.
- Use multiple severity levels. Critical (page immediately), Warning (investigate soon), Info (awareness only).
Alerting Pipeline
Metric collected (Prometheus)
-> Alerting rule evaluated (e.g., error_rate > 0.05 for 5m)
-> Alert fires (Alertmanager)
-> Notification routed (PagerDuty, Slack, email)
-> On-call engineer responds
-> Incident managed, postmortem writtenObservability Stack Example
| Layer | Open Source | Managed / SaaS |
|---|---|---|
| Metrics | Prometheus + Grafana | Datadog, New Relic, CloudWatch |
| Logs | ELK (Elasticsearch, Logstash, Kibana), Loki | Splunk, Datadog Logs, CloudWatch Logs |
| Traces | Jaeger, Zipkin, Grafana Tempo | AWS X-Ray, Datadog APM, Honeycomb |
| Instrumentation | OpenTelemetry (unified SDK) | Vendor-specific agents |
| Alerting | Alertmanager, Grafana Alerting | PagerDuty, Opsgenie, VictorOps |
OpenTelemetry (OTel) is the emerging standard for instrumentation. It provides a single SDK that generates metrics, logs, and traces in a vendor-neutral format. Instrument your code with OTel once, and export to any backend (Prometheus, Jaeger, Datadog, etc.).
Key Takeaways
- The three pillars of observability are metrics (numeric time-series), logs (discrete events), and traces (request paths across services).
- Use the USE method for infrastructure resources and the RED method for service endpoints.
- Define SLIs and SLOs to measure reliability objectively. Use error budgets to balance velocity and reliability.
- Alert on symptoms, make every alert actionable, and guard against alert fatigue.
- Use structured logging and distributed tracing with correlation IDs to debug issues across microservices.
- OpenTelemetry provides vendor-neutral instrumentation for all three pillars.