Why Observability Matters
Why This Matters
The Problem: In a distributed system with dozens of microservices, when something breaks at 3 AM, you need to quickly answer: What is broken? Where is it broken? Why did it break?
The Solution: Observability gives you the tools to understand your system's internal state from its external outputs: metrics, logs, and traces.
Real Impact: Google's SRE team uses the "Four Golden Signals" (latency, traffic, errors, saturation) to monitor every production service, enabling them to achieve 99.99%+ uptime across their global infrastructure.
Real-World Analogy
Think of observability like a hospital monitoring system:
- Metrics = Vital signs monitor (heart rate, blood pressure, temperature) - continuous numerical measurements that show trends and alert on thresholds
- Logs = Medical chart notes - detailed records of what happened and when, useful for investigation after an incident
- Traces = Following a patient through the hospital - tracking them from admission through triage, lab work, specialist consultation, and discharge to find bottlenecks
Monitoring vs Observability
Monitoring tells you when something is wrong (alerts fire). Observability helps you figure out why it is wrong and what to do about it. Monitoring is a subset of observability. You need both, but observability is the broader goal.
The Three Pillars of Observability
Metrics and Dashboards
The Four Golden Signals (Google SRE)
Latency
Time to serve a request. Track p50 (median), p95, and p99. Distinguish between successful and failed requests - a fast 500 error is not a "low latency success."
Traffic
Demand on your system. HTTP requests per second, transactions per second, or messages per second. Helps capacity planning and anomaly detection.
Errors
Rate of failed requests. Explicit errors (HTTP 500s), implicit errors (200 with wrong content), or policy violations (responses slower than SLA).
Saturation
How full is your service? CPU utilization, memory usage, disk I/O, queue depth. Most services degrade gracefully before hitting 100%.
# Instrumenting a Flask app with Prometheus metrics
from prometheus_client import (
Counter, Histogram, Gauge, generate_latest
)
from flask import Flask, request, Response
import time
app = Flask(__name__)
# Define metrics (the Four Golden Signals)
# 1. Traffic - request counter
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status"]
)
# 2. Latency - response time histogram
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"HTTP request duration in seconds",
["method", "endpoint"],
buckets=[.005, .01, .025, .05, .1,
.25, .5, 1, 2.5, 5, 10]
)
# 3. Errors - tracked via REQUEST_COUNT status label
# 4. Saturation - active requests gauge
IN_PROGRESS = Gauge(
"http_requests_in_progress",
"Number of HTTP requests in progress"
)
@app.before_request
def before_request():
request.start_time = time.time()
IN_PROGRESS.inc()
@app.after_request
def after_request(response):
latency = time.time() - request.start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.path,
status=response.status_code
).inc()
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.path
).observe(latency)
IN_PROGRESS.dec()
return response
@app.route("/metrics")
def metrics():
return Response(
generate_latest(),
mimetype="text/plain"
)
Structured Logging
Structured logs are machine-parseable (JSON) instead of free-form text. This enables powerful querying, aggregation, and correlation in log management systems.
| Aspect | Unstructured | Structured (JSON) |
|---|---|---|
| Format | 2024-01-15 ERROR Payment failed for user 42 |
{"time":"...","level":"error","user_id":42,"msg":"Payment failed"} |
| Searchability | Regex-based, fragile | Field-based queries: user_id=42 AND level=error |
| Aggregation | Hard to group or count | Easy: count errors by service, endpoint, user |
| Correlation | Manual text matching | Join on trace_id, request_id fields |
# Structured logging with Python's structlog
import structlog
import logging
import uuid
# Configure structured logging
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.StackInfoRenderer(),
structlog.processors.JSONRenderer()
],
wrapper_class=structlog.BoundLogger,
logger_factory=structlog.PrintLoggerFactory(),
)
def create_request_logger(request):
"""Create a logger bound with request context."""
return structlog.get_logger().bind(
request_id=str(uuid.uuid4()),
trace_id=request.headers.get("X-Trace-ID", "unknown"),
user_id=request.user_id,
service="payment-service",
endpoint=request.path,
)
# Usage in a request handler
def process_payment(request):
log = create_request_logger(request)
log.info("payment_started",
amount=request.amount,
currency=request.currency)
try:
result = charge_card(request.card_token, request.amount)
log.info("payment_succeeded",
transaction_id=result.id,
amount=request.amount,
latency_ms=result.latency_ms)
return result
except PaymentDeclinedError as e:
log.warning("payment_declined",
reason=e.reason,
card_last4=request.card_last4)
raise
except Exception as e:
log.error("payment_error",
error_type=type(e).__name__,
error_msg=str(e),
exc_info=True)
raise
# Output (one JSON line per log entry):
# {"timestamp":"2024-01-15T10:30:00Z","level":"info",
# "event":"payment_started","request_id":"abc-123",
# "trace_id":"trace-456","user_id":42,
# "service":"payment-service","amount":29.99}
Distributed Tracing
In a microservices architecture, a single user request might touch 5-10 services. Distributed tracing follows the entire journey of a request, showing exactly which service is slow and why.
Key Tracing Concepts
- Trace: The complete journey of a request through the system (one trace per user request)
- Span: A single unit of work within a trace (one span per service call, DB query, etc.)
- Trace ID: Unique identifier propagated across all services for correlation
- Parent Span ID: Links child spans to their parent, forming a tree
- Baggage: Key-value pairs propagated through the entire trace (e.g., user tier, region)
Alerting Best Practices
| Principle | Good | Bad |
|---|---|---|
| Alert on symptoms | "Error rate > 1% for 5 minutes" | "CPU > 80%" |
| Include context | Alert with runbook link, dashboard URL, recent deploys | "Something is wrong" |
| Avoid alert fatigue | Page only for customer-impacting issues | Page for every minor anomaly |
| Use severity levels | P1 (page), P2 (ticket), P3 (dashboard) | Everything is P1 |
| Burn rate alerts | "Consuming error budget 10x faster than expected" | "Single error occurred" |
SLIs, SLOs, and SLAs
- SLI (Service Level Indicator): A metric that measures service quality (e.g., "99.2% of requests complete in under 200ms")
- SLO (Service Level Objective): The target value for an SLI (e.g., "99.9% availability per month")
- SLA (Service Level Agreement): A contractual commitment with consequences for missing SLOs (e.g., "If uptime drops below 99.9%, customer gets credits")
- Error Budget: The allowed amount of unreliability. If SLO is 99.9%, your error budget is 0.1% (~43 minutes/month). Spend it wisely on feature releases.
Practice Problems
Medium Design a Monitoring Dashboard
You are running an e-commerce checkout service. Design a Grafana dashboard with panels for:
- The four golden signals for the checkout API
- Business metrics (orders per minute, revenue)
- Dependency health (database, payment gateway, inventory service)
Use rate() for counters, histogram_quantile() for latency percentiles. Include both technical and business metrics on the same dashboard.
# Dashboard panels (PromQL queries)
# 1. LATENCY - p50, p95, p99
# histogram_quantile(0.99,
# rate(http_request_duration_seconds_bucket{
# endpoint="/checkout"}[5m]))
# 2. TRAFFIC - requests per second
# rate(http_requests_total{
# endpoint="/checkout"}[1m])
# 3. ERRORS - error rate percentage
# rate(http_requests_total{
# endpoint="/checkout",status=~"5.."}[5m])
# / rate(http_requests_total{
# endpoint="/checkout"}[5m]) * 100
# 4. SATURATION - active connections
# http_requests_in_progress
# 5. BUSINESS - orders per minute
# rate(orders_completed_total[1m]) * 60
# 6. DEPENDENCY - payment gateway latency
# histogram_quantile(0.95,
# rate(external_call_duration_seconds_bucket{
# service="payment_gateway"}[5m]))
# 7. DEPENDENCY - DB connection pool usage
# db_connections_active / db_connections_max
Medium Troubleshoot with Traces
Users report the checkout page is slow (taking 3+ seconds). You have access to Jaeger traces. Describe your debugging process:
- How do you find the relevant traces?
- What patterns would indicate a database bottleneck?
- What patterns would indicate a downstream service issue?
Filter traces by endpoint and minimum duration. Look at the waterfall view to see which span takes the most time. Check if the slow span is a DB call, network call, or CPU computation.
# Debugging process:
# 1. FIND SLOW TRACES
# In Jaeger: service=checkout, operation=POST /checkout
# Filter: minDuration=3s
# Sort by duration descending
# 2. DATABASE BOTTLENECK indicators:
# - Multiple DB spans, each taking 500ms+
# - N+1 query pattern (100 small DB spans)
# - Long gaps between spans (connection pool wait)
# Action: check slow query log, add indexes
# 3. DOWNSTREAM SERVICE indicators:
# - Single span to payment-service taking 2.5s
# - Many retries visible in trace
# - Timeouts on external calls
# Action: check payment service health,
# consider circuit breaker, add timeout
# 4. CROSS-REFERENCE with logs:
# Copy trace_id from Jaeger
# Search in Kibana: trace_id="abc-123"
# Find error messages and context
Hard Design an Alerting Strategy
Your team is experiencing alert fatigue (50+ pages per week). Redesign the alerting strategy:
- Define severity levels and escalation paths
- Convert existing CPU/memory alerts to symptom-based alerts
- Implement error budget-based alerting
Only page for customer-facing impact. Use burn rate alerts: if you are spending your monthly error budget 10x too fast, that is a P1. CPU at 80% is only a P3 (dashboard) unless it causes latency degradation.
# Alerting Strategy Redesign
# SEVERITY LEVELS:
# P1 (Page on-call): Customer-impacting NOW
# - Error rate > 1% for 5 min
# - p99 latency > 5s for 5 min
# - Complete outage of any service
# P2 (Create ticket): Degraded but not critical
# - Error rate > 0.5% for 15 min
# - Elevated latency (p95 > 2s)
# - Disk usage > 85%
# P3 (Dashboard only): Worth watching
# - CPU > 80% (not customer-impacting)
# - Memory > 75%
# - Unusual traffic patterns
# ERROR BUDGET ALERTS:
# SLO: 99.9% success rate (error budget: 0.1%)
# Monthly budget: ~43 minutes of downtime
# Alert if burn rate exceeds:
# - 14.4x in 1 hour -> P1 (page)
# - 6x in 6 hours -> P2 (ticket)
# - 1x over 3 days -> P3 (review)
Quick Reference
Observability Tool Stack
| Pillar | Open Source | Commercial | Cloud-Native |
|---|---|---|---|
| Metrics | Prometheus + Grafana | Datadog, New Relic | CloudWatch, Stackdriver |
| Logs | ELK Stack, Loki | Splunk, Sumo Logic | CloudWatch Logs |
| Traces | Jaeger, Zipkin | Datadog APM, Lightstep | X-Ray, Cloud Trace |
| All-in-one | OpenTelemetry (collector) | Datadog, Dynatrace | Azure Monitor |
Key Concepts Summary
Observability Checklist
- Instrument early: Add metrics, logs, and traces from day one, not after an outage
- Use OpenTelemetry: Vendor-neutral standard for instrumentation
- Correlate signals: Use trace IDs to link metrics, logs, and traces together
- Define SLOs before alerting: Know your targets before setting thresholds
- Practice runbooks: Every alert should have a documented response procedure
- Conduct blameless postmortems: After incidents, focus on improving systems, not blaming people