Observability

Part of Module 7: Hands-On Projects

📦 MLflow - Experiment Tracking

What is MLflow?

Meaning: Open-source platform for managing the end-to-end machine learning lifecycle.
Example: Data scientist runs 100 experiments → MLflow tracks parameters, metrics, models → compares results → deploys best model.

Core Components:

  • Tracking: Log experiments, parameters, metrics
  • Projects: Package code for reproducibility
  • Models: Deploy models from various frameworks
  • Registry: Central model store with versioning
# MLflow Tracking Example
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Start MLflow run
with mlflow.start_run(run_name="rf_experiment"):
    # Log parameters
    n_estimators = 100
    max_depth = 10
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth
    )
    model.fit(X_train, y_train)
    
    # Log metrics
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    f1 = f1_score(y_test, predictions, average='weighted')
    
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("f1_score", f1)
    
    # Log model
    mlflow.sklearn.log_model(
        model, 
        "model",
        signature=mlflow.models.infer_signature(X_train, predictions)
    )
    
    # Log artifacts (plots, data samples)
    mlflow.log_artifact("plots/confusion_matrix.png")
    mlflow.log_artifact("data/feature_importance.csv")

📈 Prometheus & Grafana

Metrics Collection & Visualization

Meaning: Prometheus collects time-series metrics, Grafana visualizes them in dashboards.
Example: Model serving latency spikes → Prometheus alerts → Grafana dashboard shows root cause → auto-scaling triggered.
# Prometheus metrics in Python
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
prediction_counter = Counter(
    'model_predictions_total', 
    'Total predictions made',
    ['model_name', 'version']
)

latency_histogram = Histogram(
    'prediction_latency_seconds',
    'Prediction latency',
    ['model_name']
)

gpu_utilization = Gauge(
    'gpu_utilization_percent',
    'GPU utilization percentage'
)

# Instrument your code
@latency_histogram.labels(model_name="bert").time()
def predict(input_data):
    # Model prediction logic
    result = model.predict(input_data)
    
    # Increment counter
    prediction_counter.labels(
        model_name="bert",
        version="v1.0"
    ).inc()
    
    return result

# Start metrics server
start_http_server(8000)

Grafana Dashboard Config:

# PromQL queries for Grafana
# Request rate
rate(model_predictions_total[5m])

# P95 latency
histogram_quantile(0.95, 
  rate(prediction_latency_seconds_bucket[5m])
)

# Error rate
rate(model_errors_total[5m]) / 
  rate(model_predictions_total[5m])

# GPU memory usage
avg(gpu_memory_used_bytes) / 
  avg(gpu_memory_total_bytes) * 100

🐶 Datadog - APM for ML

Application Performance Monitoring

Meaning: Full-stack observability platform with ML-specific monitoring capabilities.
Example: Trace request from API gateway → feature processing → model inference → response with detailed timing.
# Datadog APM Integration
from ddtrace import tracer, patch
from datadog import initialize, statsd

# Initialize Datadog
initialize(
    api_key="YOUR_API_KEY",
    app_key="YOUR_APP_KEY"
)

# Auto-instrument libraries
patch(sklearn=True, pytorch=True)

# Custom tracing
@tracer.wrap(service="ml-inference", resource="predict")
def predict_with_model(features):
    with tracer.trace("feature_preprocessing"):
        processed = preprocess(features)
    
    with tracer.trace("model_inference"):
        prediction = model.predict(processed)
        
        # Custom metrics
        statsd.histogram(
            'model.prediction.confidence',
            prediction.confidence,
            tags=[f"model:{model_name}"]
        )
    
    return prediction

# Log correlation
import logging
from ddtrace import helpers

logger = logging.getLogger(__name__)
helpers.patch_logging(logger)

🎯 Weights & Biases

ML Experiment Tracking & Visualization

Meaning: Platform for tracking experiments, visualizing results, and collaborating on ML projects.
Example: Team trains 50 models → W&B tracks hyperparameters → visualizes loss curves → compares performance → shares best model.
# W&B Integration
import wandb
import torch
import torch.nn as nn

# Initialize W&B
wandb.init(
    project="model-training",
    config={
        "learning_rate": 0.001,
        "epochs": 100,
        "batch_size": 32,
        "architecture": "ResNet50"
    }
)

# Training loop with logging
for epoch in range(config.epochs):
    for batch in dataloader:
        loss = train_step(batch)
        
        # Log metrics
        wandb.log({
            "loss": loss,
            "learning_rate": optimizer.param_groups[0]['lr'],
            "gpu_utilization": get_gpu_util()
        })
    
    # Log validation metrics
    val_loss, val_acc = validate(model, val_loader)
    wandb.log({
        "val_loss": val_loss,
        "val_accuracy": val_acc,
        "epoch": epoch
    })
    
    # Log model checkpoint
    if val_acc > best_acc:
        wandb.save("model.pt")
        
# Log final model
wandb.save("final_model.pt")
wandb.finish()

W&B Features:

  • Sweeps: Hyperparameter optimization
  • Artifacts: Version datasets and models
  • Reports: Share results with stakeholders
  • Integrations: PyTorch, TensorFlow, Keras, XGBoost

🔍 Distributed Tracing & Monitoring

End-to-End ML Pipeline Observability

Key Metrics to Monitor:

  • Model Performance: Accuracy, F1, AUC over time
  • Data Quality: Missing values, schema changes
  • Inference Metrics: Latency, throughput, errors
  • Resource Usage: CPU, GPU, memory, cost
  • Business KPIs: Revenue impact, user engagement
# OpenTelemetry for ML Observability
from opentelemetry import trace, metrics
from opentelemetry.exporter.otlp.proto.grpc import (
    trace_exporter, metrics_exporter
)

# Setup tracing
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Create metrics
prediction_counter = meter.create_counter(
    "predictions",
    description="Number of predictions made"
)

latency_recorder = meter.create_histogram(
    "prediction_latency",
    description="Prediction latency in ms"
)

def predict(request):
    with tracer.start_as_current_span("prediction") as span:
        # Add span attributes
        span.set_attribute("model.name", model_name)
        span.set_attribute("model.version", model_version)
        
        start_time = time.time()
        
        with tracer.start_as_current_span("preprocessing"):
            features = preprocess(request)
        
        with tracer.start_as_current_span("inference"):
            prediction = model.predict(features)
        
        # Record metrics
        latency = (time.time() - start_time) * 1000
        latency_recorder.record(latency)
        prediction_counter.add(1, {"model": model_name})
        
        return prediction

Alert Configuration

Critical Alerts:

  • Model Drift: Performance degradation > 5%
  • High Latency: P99 > SLA threshold
  • Error Rate: Errors > 1% of requests
  • Resource Exhaustion: GPU memory > 90%
  • Data Drift: Feature distribution shift
# AlertManager configuration
# alerting_rules.yml
groups:
  - name: ml_alerts
    rules:
      - alert: ModelPerformanceDegradation
        expr: |
          rate(model_accuracy[5m]) < 0.95
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Model accuracy dropped below threshold"
          description: "Model {{ $labels.model }} accuracy is {{ $value }}"
      
      - alert: HighPredictionLatency
        expr: |
          histogram_quantile(0.99, 
            rate(prediction_latency_bucket[5m])
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High prediction latency detected"

✅ Best Practices

ML Observability Guidelines

Implementation Strategy:

  • Start with basic metrics (latency, errors, throughput)
  • Add ML-specific metrics (accuracy, drift)
  • Implement distributed tracing for complex pipelines
  • Set up alerting for critical thresholds
  • Create dashboards for different stakeholders
  • Automate reporting and anomaly detection

Monitoring Stack Comparison:

Tool Best For Pros Cons
MLflow Experiment tracking Open-source, simple Limited monitoring
W&B Research teams Great visualizations Paid for teams
Prometheus Metrics collection Powerful, flexible Complex setup
Datadog Full-stack APM Comprehensive Expensive

Common Pitfalls:

  • Not tracking data drift from day one
  • Missing business metric correlation
  • Over-monitoring (alert fatigue)
  • No baseline metrics before deployment
  • Ignoring cost metrics
  • Not monitoring feature pipeline

Module 7: Hands-On Projects Topics