Monitoring, Logging & Observability

Medium 25 min read

Why Observability Matters

Why This Matters

The Problem: In a distributed system with dozens of microservices, when something breaks at 3 AM, you need to quickly answer: What is broken? Where is it broken? Why did it break?

The Solution: Observability gives you the tools to understand your system's internal state from its external outputs: metrics, logs, and traces.

Real Impact: Google's SRE team uses the "Four Golden Signals" (latency, traffic, errors, saturation) to monitor every production service, enabling them to achieve 99.99%+ uptime across their global infrastructure.

Real-World Analogy

Think of observability like a hospital monitoring system:

  • Metrics = Vital signs monitor (heart rate, blood pressure, temperature) - continuous numerical measurements that show trends and alert on thresholds
  • Logs = Medical chart notes - detailed records of what happened and when, useful for investigation after an incident
  • Traces = Following a patient through the hospital - tracking them from admission through triage, lab work, specialist consultation, and discharge to find bottlenecks

Monitoring vs Observability

Monitoring tells you when something is wrong (alerts fire). Observability helps you figure out why it is wrong and what to do about it. Monitoring is a subset of observability. You need both, but observability is the broader goal.

The Three Pillars of Observability

Three Pillars of Observability
Metrics What is happening? Counters (requests, errors) Gauges (CPU, memory) Histograms (latency p50/p99) Summaries (quantiles) Tools: Prometheus Grafana, Datadog Logs Why did it happen? Structured JSON logs Error stack traces Audit trails Request/response details Tools: ELK Stack Loki, Splunk Traces Where is the bottleneck? Request flow across services Span timing breakdown Dependency mapping Latency attribution Tools: Jaeger, Zipkin OpenTelemetry Correlated via Trace IDs and Timestamps

Metrics and Dashboards

The Four Golden Signals (Google SRE)

Latency

Time to serve a request. Track p50 (median), p95, and p99. Distinguish between successful and failed requests - a fast 500 error is not a "low latency success."

Traffic

Demand on your system. HTTP requests per second, transactions per second, or messages per second. Helps capacity planning and anomaly detection.

Errors

Rate of failed requests. Explicit errors (HTTP 500s), implicit errors (200 with wrong content), or policy violations (responses slower than SLA).

Saturation

How full is your service? CPU utilization, memory usage, disk I/O, queue depth. Most services degrade gracefully before hitting 100%.

prometheus_metrics.py
# Instrumenting a Flask app with Prometheus metrics
from prometheus_client import (
    Counter, Histogram, Gauge, generate_latest
)
from flask import Flask, request, Response
import time

app = Flask(__name__)

# Define metrics (the Four Golden Signals)

# 1. Traffic - request counter
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)

# 2. Latency - response time histogram
REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration in seconds",
    ["method", "endpoint"],
    buckets=[.005, .01, .025, .05, .1,
             .25, .5, 1, 2.5, 5, 10]
)

# 3. Errors - tracked via REQUEST_COUNT status label

# 4. Saturation - active requests gauge
IN_PROGRESS = Gauge(
    "http_requests_in_progress",
    "Number of HTTP requests in progress"
)

@app.before_request
def before_request():
    request.start_time = time.time()
    IN_PROGRESS.inc()

@app.after_request
def after_request(response):
    latency = time.time() - request.start_time
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.path,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.path
    ).observe(latency)
    IN_PROGRESS.dec()
    return response

@app.route("/metrics")
def metrics():
    return Response(
        generate_latest(),
        mimetype="text/plain"
    )

Structured Logging

Structured logs are machine-parseable (JSON) instead of free-form text. This enables powerful querying, aggregation, and correlation in log management systems.

Aspect Unstructured Structured (JSON)
Format 2024-01-15 ERROR Payment failed for user 42 {"time":"...","level":"error","user_id":42,"msg":"Payment failed"}
Searchability Regex-based, fragile Field-based queries: user_id=42 AND level=error
Aggregation Hard to group or count Easy: count errors by service, endpoint, user
Correlation Manual text matching Join on trace_id, request_id fields
structured_logging.py
# Structured logging with Python's structlog
import structlog
import logging
import uuid

# Configure structured logging
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.StackInfoRenderer(),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.BoundLogger,
    logger_factory=structlog.PrintLoggerFactory(),
)

def create_request_logger(request):
    """Create a logger bound with request context."""
    return structlog.get_logger().bind(
        request_id=str(uuid.uuid4()),
        trace_id=request.headers.get("X-Trace-ID", "unknown"),
        user_id=request.user_id,
        service="payment-service",
        endpoint=request.path,
    )

# Usage in a request handler
def process_payment(request):
    log = create_request_logger(request)

    log.info("payment_started",
             amount=request.amount,
             currency=request.currency)

    try:
        result = charge_card(request.card_token, request.amount)
        log.info("payment_succeeded",
                 transaction_id=result.id,
                 amount=request.amount,
                 latency_ms=result.latency_ms)
        return result
    except PaymentDeclinedError as e:
        log.warning("payment_declined",
                    reason=e.reason,
                    card_last4=request.card_last4)
        raise
    except Exception as e:
        log.error("payment_error",
                  error_type=type(e).__name__,
                  error_msg=str(e),
                  exc_info=True)
        raise

# Output (one JSON line per log entry):
# {"timestamp":"2024-01-15T10:30:00Z","level":"info",
#  "event":"payment_started","request_id":"abc-123",
#  "trace_id":"trace-456","user_id":42,
#  "service":"payment-service","amount":29.99}

Distributed Tracing

In a microservices architecture, a single user request might touch 5-10 services. Distributed tracing follows the entire journey of a request, showing exactly which service is slow and why.

Key Tracing Concepts

  • Trace: The complete journey of a request through the system (one trace per user request)
  • Span: A single unit of work within a trace (one span per service call, DB query, etc.)
  • Trace ID: Unique identifier propagated across all services for correlation
  • Parent Span ID: Links child spans to their parent, forming a tree
  • Baggage: Key-value pairs propagated through the entire trace (e.g., user tier, region)
Distributed Trace: Order Checkout Flow
0ms 50ms 100ms 150ms API Gateway 187ms Auth Service 15ms Order Service 142ms Inventory Svc 28ms Payment Svc 95ms (SLOW) Payment DB 72ms (bottleneck!)

Alerting Best Practices

Principle Good Bad
Alert on symptoms "Error rate > 1% for 5 minutes" "CPU > 80%"
Include context Alert with runbook link, dashboard URL, recent deploys "Something is wrong"
Avoid alert fatigue Page only for customer-impacting issues Page for every minor anomaly
Use severity levels P1 (page), P2 (ticket), P3 (dashboard) Everything is P1
Burn rate alerts "Consuming error budget 10x faster than expected" "Single error occurred"

SLIs, SLOs, and SLAs

  • SLI (Service Level Indicator): A metric that measures service quality (e.g., "99.2% of requests complete in under 200ms")
  • SLO (Service Level Objective): The target value for an SLI (e.g., "99.9% availability per month")
  • SLA (Service Level Agreement): A contractual commitment with consequences for missing SLOs (e.g., "If uptime drops below 99.9%, customer gets credits")
  • Error Budget: The allowed amount of unreliability. If SLO is 99.9%, your error budget is 0.1% (~43 minutes/month). Spend it wisely on feature releases.

Practice Problems

Medium Design a Monitoring Dashboard

You are running an e-commerce checkout service. Design a Grafana dashboard with panels for:

  1. The four golden signals for the checkout API
  2. Business metrics (orders per minute, revenue)
  3. Dependency health (database, payment gateway, inventory service)

Use rate() for counters, histogram_quantile() for latency percentiles. Include both technical and business metrics on the same dashboard.

# Dashboard panels (PromQL queries)

# 1. LATENCY - p50, p95, p99
# histogram_quantile(0.99,
#   rate(http_request_duration_seconds_bucket{
#     endpoint="/checkout"}[5m]))

# 2. TRAFFIC - requests per second
# rate(http_requests_total{
#   endpoint="/checkout"}[1m])

# 3. ERRORS - error rate percentage
# rate(http_requests_total{
#   endpoint="/checkout",status=~"5.."}[5m])
# / rate(http_requests_total{
#   endpoint="/checkout"}[5m]) * 100

# 4. SATURATION - active connections
# http_requests_in_progress

# 5. BUSINESS - orders per minute
# rate(orders_completed_total[1m]) * 60

# 6. DEPENDENCY - payment gateway latency
# histogram_quantile(0.95,
#   rate(external_call_duration_seconds_bucket{
#     service="payment_gateway"}[5m]))

# 7. DEPENDENCY - DB connection pool usage
# db_connections_active / db_connections_max

Medium Troubleshoot with Traces

Users report the checkout page is slow (taking 3+ seconds). You have access to Jaeger traces. Describe your debugging process:

  1. How do you find the relevant traces?
  2. What patterns would indicate a database bottleneck?
  3. What patterns would indicate a downstream service issue?

Filter traces by endpoint and minimum duration. Look at the waterfall view to see which span takes the most time. Check if the slow span is a DB call, network call, or CPU computation.

# Debugging process:

# 1. FIND SLOW TRACES
# In Jaeger: service=checkout, operation=POST /checkout
# Filter: minDuration=3s
# Sort by duration descending

# 2. DATABASE BOTTLENECK indicators:
# - Multiple DB spans, each taking 500ms+
# - N+1 query pattern (100 small DB spans)
# - Long gaps between spans (connection pool wait)
# Action: check slow query log, add indexes

# 3. DOWNSTREAM SERVICE indicators:
# - Single span to payment-service taking 2.5s
# - Many retries visible in trace
# - Timeouts on external calls
# Action: check payment service health,
#         consider circuit breaker, add timeout

# 4. CROSS-REFERENCE with logs:
# Copy trace_id from Jaeger
# Search in Kibana: trace_id="abc-123"
# Find error messages and context

Hard Design an Alerting Strategy

Your team is experiencing alert fatigue (50+ pages per week). Redesign the alerting strategy:

  1. Define severity levels and escalation paths
  2. Convert existing CPU/memory alerts to symptom-based alerts
  3. Implement error budget-based alerting

Only page for customer-facing impact. Use burn rate alerts: if you are spending your monthly error budget 10x too fast, that is a P1. CPU at 80% is only a P3 (dashboard) unless it causes latency degradation.

# Alerting Strategy Redesign

# SEVERITY LEVELS:
# P1 (Page on-call): Customer-impacting NOW
#   - Error rate > 1% for 5 min
#   - p99 latency > 5s for 5 min
#   - Complete outage of any service

# P2 (Create ticket): Degraded but not critical
#   - Error rate > 0.5% for 15 min
#   - Elevated latency (p95 > 2s)
#   - Disk usage > 85%

# P3 (Dashboard only): Worth watching
#   - CPU > 80% (not customer-impacting)
#   - Memory > 75%
#   - Unusual traffic patterns

# ERROR BUDGET ALERTS:
# SLO: 99.9% success rate (error budget: 0.1%)
# Monthly budget: ~43 minutes of downtime

# Alert if burn rate exceeds:
# - 14.4x in 1 hour   -> P1 (page)
# - 6x in 6 hours     -> P2 (ticket)
# - 1x over 3 days    -> P3 (review)

Quick Reference

Observability Tool Stack

Pillar Open Source Commercial Cloud-Native
Metrics Prometheus + Grafana Datadog, New Relic CloudWatch, Stackdriver
Logs ELK Stack, Loki Splunk, Sumo Logic CloudWatch Logs
Traces Jaeger, Zipkin Datadog APM, Lightstep X-Ray, Cloud Trace
All-in-one OpenTelemetry (collector) Datadog, Dynatrace Azure Monitor

Key Concepts Summary

Observability Checklist

  • Instrument early: Add metrics, logs, and traces from day one, not after an outage
  • Use OpenTelemetry: Vendor-neutral standard for instrumentation
  • Correlate signals: Use trace IDs to link metrics, logs, and traces together
  • Define SLOs before alerting: Know your targets before setting thresholds
  • Practice runbooks: Every alert should have a documented response procedure
  • Conduct blameless postmortems: After incidents, focus on improving systems, not blaming people