Monitoring, Logging & Observability

Why Observability Matters

Why This Matters

The Problem: In a distributed system with dozens of microservices, when something breaks at 3 AM, you need to quickly answer: What is broken? Where is it broken? Why did it break?

The Solution: Observability gives you the tools to understand your system's internal state from its external outputs: metrics, logs, and traces.

Real Impact: Google's SRE team uses the "Four Golden Signals" (latency, traffic, errors, saturation) to monitor every production service, enabling them to achieve 99.99%+ uptime across their global infrastructure.

Real-World Analogy

Think of observability like a hospital monitoring system:

Metrics = Vital signs monitor (heart rate, blood pressure, temperature) - continuous numerical measurements that show trends and alert on thresholds
Logs = Medical chart notes - detailed records of what happened and when, useful for investigation after an incident
Traces = Following a patient through the hospital - tracking them from admission through triage, lab work, specialist consultation, and discharge to find bottlenecks

Monitoring vs Observability

Monitoring tells you when something is wrong (alerts fire). Observability helps you figure out why it is wrong and what to do about it. Monitoring is a subset of observability. You need both, but observability is the broader goal.

The Three Pillars of Observability

Three Pillars of Observability

Metrics and Dashboards

The Four Golden Signals (Google SRE)

Latency

Time to serve a request. Track p50 (median), p95, and p99. Distinguish between successful and failed requests - a fast 500 error is not a "low latency success."

Traffic

Demand on your system. HTTP requests per second, transactions per second, or messages per second. Helps capacity planning and anomaly detection.

Errors

Rate of failed requests. Explicit errors (HTTP 500s), implicit errors (200 with wrong content), or policy violations (responses slower than SLA).

Saturation

How full is your service? CPU utilization, memory usage, disk I/O, queue depth. Most services degrade gracefully before hitting 100%.

prometheus_metrics.py

# Instrumenting a Flask app with Prometheus metrics
from prometheus_client import (
    Counter, Histogram, Gauge, generate_latest
)
from flask import Flask, request, Response
import time

app = Flask(__name__)

# Define metrics (the Four Golden Signals)

# 1. Traffic - request counter
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status"]
)

# 2. Latency - response time histogram
REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration in seconds",
    ["method", "endpoint"],
    buckets=[.005, .01, .025, .05, .1,
             .25, .5, 1, 2.5, 5, 10]
)

# 3. Errors - tracked via REQUEST_COUNT status label

# 4. Saturation - active requests gauge
IN_PROGRESS = Gauge(
    "http_requests_in_progress",
    "Number of HTTP requests in progress"
)

@app.before_request
def before_request():
    request.start_time = time.time()
    IN_PROGRESS.inc()

@app.after_request
def after_request(response):
    latency = time.time() - request.start_time
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.path,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.path
    ).observe(latency)
    IN_PROGRESS.dec()
    return response

@app.route("/metrics")
def metrics():
    return Response(
        generate_latest(),
        mimetype="text/plain"
    )

Structured Logging

Structured logs are machine-parseable (JSON) instead of free-form text. This enables powerful querying, aggregation, and correlation in log management systems.

Aspect	Unstructured	Structured (JSON)
Format	`2024-01-15 ERROR Payment failed for user 42`	`{"time":"...","level":"error","user_id":42,"msg":"Payment failed"}`
Searchability	Regex-based, fragile	Field-based queries: `user_id=42 AND level=error`
Aggregation	Hard to group or count	Easy: count errors by service, endpoint, user
Correlation	Manual text matching	Join on trace_id, request_id fields

structured_logging.py

# Structured logging with Python's structlog
import structlog
import logging
import uuid

# Configure structured logging
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.StackInfoRenderer(),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.BoundLogger,
    logger_factory=structlog.PrintLoggerFactory(),
)

def create_request_logger(request):
    """Create a logger bound with request context."""
    return structlog.get_logger().bind(
        request_id=str(uuid.uuid4()),
        trace_id=request.headers.get("X-Trace-ID", "unknown"),
        user_id=request.user_id,
        service="payment-service",
        endpoint=request.path,
    )

# Usage in a request handler
def process_payment(request):
    log = create_request_logger(request)

    log.info("payment_started",
             amount=request.amount,
             currency=request.currency)

    try:
        result = charge_card(request.card_token, request.amount)
        log.info("payment_succeeded",
                 transaction_id=result.id,
                 amount=request.amount,
                 latency_ms=result.latency_ms)
        return result
    except PaymentDeclinedError as e:
        log.warning("payment_declined",
                    reason=e.reason,
                    card_last4=request.card_last4)
        raise
    except Exception as e:
        log.error("payment_error",
                  error_type=type(e).__name__,
                  error_msg=str(e),
                  exc_info=True)
        raise

# Output (one JSON line per log entry):
# {"timestamp":"2024-01-15T10:30:00Z","level":"info",
#  "event":"payment_started","request_id":"abc-123",
#  "trace_id":"trace-456","user_id":42,
#  "service":"payment-service","amount":29.99}

Distributed Tracing

In a microservices architecture, a single user request might touch 5-10 services. Distributed tracing follows the entire journey of a request, showing exactly which service is slow and why.

Key Tracing Concepts

Trace: The complete journey of a request through the system (one trace per user request)
Span: A single unit of work within a trace (one span per service call, DB query, etc.)
Trace ID: Unique identifier propagated across all services for correlation
Parent Span ID: Links child spans to their parent, forming a tree
Baggage: Key-value pairs propagated through the entire trace (e.g., user tier, region)

Distributed Trace: Order Checkout Flow

Alerting Best Practices

Principle	Good	Bad
Alert on symptoms	"Error rate > 1% for 5 minutes"	"CPU > 80%"
Include context	Alert with runbook link, dashboard URL, recent deploys	"Something is wrong"
Avoid alert fatigue	Page only for customer-impacting issues	Page for every minor anomaly
Use severity levels	P1 (page), P2 (ticket), P3 (dashboard)	Everything is P1
Burn rate alerts	"Consuming error budget 10x faster than expected"	"Single error occurred"

SLIs, SLOs, and SLAs

SLI (Service Level Indicator): A metric that measures service quality (e.g., "99.2% of requests complete in under 200ms")
SLO (Service Level Objective): The target value for an SLI (e.g., "99.9% availability per month")
SLA (Service Level Agreement): A contractual commitment with consequences for missing SLOs (e.g., "If uptime drops below 99.9%, customer gets credits")
Error Budget: The allowed amount of unreliability. If SLO is 99.9%, your error budget is 0.1% (~43 minutes/month). Spend it wisely on feature releases.

Practice Problems

Medium Design a Monitoring Dashboard

You are running an e-commerce checkout service. Design a Grafana dashboard with panels for:

The four golden signals for the checkout API
Business metrics (orders per minute, revenue)
Dependency health (database, payment gateway, inventory service)

Use rate() for counters, histogram_quantile() for latency percentiles. Include both technical and business metrics on the same dashboard.

# Dashboard panels (PromQL queries)

# 1. LATENCY - p50, p95, p99
# histogram_quantile(0.99,
#   rate(http_request_duration_seconds_bucket{
#     endpoint="/checkout"}[5m]))

# 2. TRAFFIC - requests per second
# rate(http_requests_total{
#   endpoint="/checkout"}[1m])

# 3. ERRORS - error rate percentage
# rate(http_requests_total{
#   endpoint="/checkout",status=~"5.."}[5m])
# / rate(http_requests_total{
#   endpoint="/checkout"}[5m]) * 100

# 4. SATURATION - active connections
# http_requests_in_progress

# 5. BUSINESS - orders per minute
# rate(orders_completed_total[1m]) * 60

# 6. DEPENDENCY - payment gateway latency
# histogram_quantile(0.95,
#   rate(external_call_duration_seconds_bucket{
#     service="payment_gateway"}[5m]))

# 7. DEPENDENCY - DB connection pool usage
# db_connections_active / db_connections_max

Medium Troubleshoot with Traces

Users report the checkout page is slow (taking 3+ seconds). You have access to Jaeger traces. Describe your debugging process:

How do you find the relevant traces?
What patterns would indicate a database bottleneck?
What patterns would indicate a downstream service issue?

Filter traces by endpoint and minimum duration. Look at the waterfall view to see which span takes the most time. Check if the slow span is a DB call, network call, or CPU computation.

# Debugging process:

# 1. FIND SLOW TRACES
# In Jaeger: service=checkout, operation=POST /checkout
# Filter: minDuration=3s
# Sort by duration descending

# 2. DATABASE BOTTLENECK indicators:
# - Multiple DB spans, each taking 500ms+
# - N+1 query pattern (100 small DB spans)
# - Long gaps between spans (connection pool wait)
# Action: check slow query log, add indexes

# 3. DOWNSTREAM SERVICE indicators:
# - Single span to payment-service taking 2.5s
# - Many retries visible in trace
# - Timeouts on external calls
# Action: check payment service health,
#         consider circuit breaker, add timeout

# 4. CROSS-REFERENCE with logs:
# Copy trace_id from Jaeger
# Search in Kibana: trace_id="abc-123"
# Find error messages and context

Hard Design an Alerting Strategy

Your team is experiencing alert fatigue (50+ pages per week). Redesign the alerting strategy:

Define severity levels and escalation paths
Convert existing CPU/memory alerts to symptom-based alerts
Implement error budget-based alerting

Only page for customer-facing impact. Use burn rate alerts: if you are spending your monthly error budget 10x too fast, that is a P1. CPU at 80% is only a P3 (dashboard) unless it causes latency degradation.

# Alerting Strategy Redesign

# SEVERITY LEVELS:
# P1 (Page on-call): Customer-impacting NOW
#   - Error rate > 1% for 5 min
#   - p99 latency > 5s for 5 min
#   - Complete outage of any service

# P2 (Create ticket): Degraded but not critical
#   - Error rate > 0.5% for 15 min
#   - Elevated latency (p95 > 2s)
#   - Disk usage > 85%

# P3 (Dashboard only): Worth watching
#   - CPU > 80% (not customer-impacting)
#   - Memory > 75%
#   - Unusual traffic patterns

# ERROR BUDGET ALERTS:
# SLO: 99.9% success rate (error budget: 0.1%)
# Monthly budget: ~43 minutes of downtime

# Alert if burn rate exceeds:
# - 14.4x in 1 hour   -> P1 (page)
# - 6x in 6 hours     -> P2 (ticket)
# - 1x over 3 days    -> P3 (review)

Quick Reference

Observability Tool Stack

Pillar	Open Source	Commercial	Cloud-Native
Metrics	Prometheus + Grafana	Datadog, New Relic	CloudWatch, Stackdriver
Logs	ELK Stack, Loki	Splunk, Sumo Logic	CloudWatch Logs
Traces	Jaeger, Zipkin	Datadog APM, Lightstep	X-Ray, Cloud Trace
All-in-one	OpenTelemetry (collector)	Datadog, Dynatrace	Azure Monitor

Key Concepts Summary

Observability Checklist

Instrument early: Add metrics, logs, and traces from day one, not after an outage
Use OpenTelemetry: Vendor-neutral standard for instrumentation
Correlate signals: Use trace IDs to link metrics, logs, and traces together
Define SLOs before alerting: Know your targets before setting thresholds
Practice runbooks: Every alert should have a documented response procedure
Conduct blameless postmortems: After incidents, focus on improving systems, not blaming people