📦 MLflow - Experiment Tracking
What is MLflow?
Meaning: Open-source platform for managing the end-to-end machine learning lifecycle.
Example: Data scientist runs 100 experiments → MLflow tracks parameters, metrics, models → compares results → deploys best model.
Core Components:
- Tracking: Log experiments, parameters, metrics
- Projects: Package code for reproducibility
- Models: Deploy models from various frameworks
- Registry: Central model store with versioning
# MLflow Tracking Example import mlflow import mlflow.sklearn from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, f1_score # Start MLflow run with mlflow.start_run(run_name="rf_experiment"): # Log parameters n_estimators = 100 max_depth = 10 mlflow.log_param("n_estimators", n_estimators) mlflow.log_param("max_depth", max_depth) # Train model model = RandomForestClassifier( n_estimators=n_estimators, max_depth=max_depth ) model.fit(X_train, y_train) # Log metrics predictions = model.predict(X_test) accuracy = accuracy_score(y_test, predictions) f1 = f1_score(y_test, predictions, average='weighted') mlflow.log_metric("accuracy", accuracy) mlflow.log_metric("f1_score", f1) # Log model mlflow.sklearn.log_model( model, "model", signature=mlflow.models.infer_signature(X_train, predictions) ) # Log artifacts (plots, data samples) mlflow.log_artifact("plots/confusion_matrix.png") mlflow.log_artifact("data/feature_importance.csv")
📈 Prometheus & Grafana
Metrics Collection & Visualization
Meaning: Prometheus collects time-series metrics, Grafana visualizes them in dashboards.
Example: Model serving latency spikes → Prometheus alerts → Grafana dashboard shows root cause → auto-scaling triggered.
# Prometheus metrics in Python from prometheus_client import Counter, Histogram, Gauge, start_http_server import time # Define metrics prediction_counter = Counter( 'model_predictions_total', 'Total predictions made', ['model_name', 'version'] ) latency_histogram = Histogram( 'prediction_latency_seconds', 'Prediction latency', ['model_name'] ) gpu_utilization = Gauge( 'gpu_utilization_percent', 'GPU utilization percentage' ) # Instrument your code @latency_histogram.labels(model_name="bert").time() def predict(input_data): # Model prediction logic result = model.predict(input_data) # Increment counter prediction_counter.labels( model_name="bert", version="v1.0" ).inc() return result # Start metrics server start_http_server(8000)
Grafana Dashboard Config:
# PromQL queries for Grafana # Request rate rate(model_predictions_total[5m]) # P95 latency histogram_quantile(0.95, rate(prediction_latency_seconds_bucket[5m]) ) # Error rate rate(model_errors_total[5m]) / rate(model_predictions_total[5m]) # GPU memory usage avg(gpu_memory_used_bytes) / avg(gpu_memory_total_bytes) * 100
🐶 Datadog - APM for ML
Application Performance Monitoring
Meaning: Full-stack observability platform with ML-specific monitoring capabilities.
Example: Trace request from API gateway → feature processing → model inference → response with detailed timing.
# Datadog APM Integration from ddtrace import tracer, patch from datadog import initialize, statsd # Initialize Datadog initialize( api_key="YOUR_API_KEY", app_key="YOUR_APP_KEY" ) # Auto-instrument libraries patch(sklearn=True, pytorch=True) # Custom tracing @tracer.wrap(service="ml-inference", resource="predict") def predict_with_model(features): with tracer.trace("feature_preprocessing"): processed = preprocess(features) with tracer.trace("model_inference"): prediction = model.predict(processed) # Custom metrics statsd.histogram( 'model.prediction.confidence', prediction.confidence, tags=[f"model:{model_name}"] ) return prediction # Log correlation import logging from ddtrace import helpers logger = logging.getLogger(__name__) helpers.patch_logging(logger)
🎯 Weights & Biases
ML Experiment Tracking & Visualization
Meaning: Platform for tracking experiments, visualizing results, and collaborating on ML projects.
Example: Team trains 50 models → W&B tracks hyperparameters → visualizes loss curves → compares performance → shares best model.
# W&B Integration import wandb import torch import torch.nn as nn # Initialize W&B wandb.init( project="model-training", config={ "learning_rate": 0.001, "epochs": 100, "batch_size": 32, "architecture": "ResNet50" } ) # Training loop with logging for epoch in range(config.epochs): for batch in dataloader: loss = train_step(batch) # Log metrics wandb.log({ "loss": loss, "learning_rate": optimizer.param_groups[0]['lr'], "gpu_utilization": get_gpu_util() }) # Log validation metrics val_loss, val_acc = validate(model, val_loader) wandb.log({ "val_loss": val_loss, "val_accuracy": val_acc, "epoch": epoch }) # Log model checkpoint if val_acc > best_acc: wandb.save("model.pt") # Log final model wandb.save("final_model.pt") wandb.finish()
W&B Features:
- Sweeps: Hyperparameter optimization
- Artifacts: Version datasets and models
- Reports: Share results with stakeholders
- Integrations: PyTorch, TensorFlow, Keras, XGBoost
🔍 Distributed Tracing & Monitoring
End-to-End ML Pipeline Observability
Key Metrics to Monitor:
- Model Performance: Accuracy, F1, AUC over time
- Data Quality: Missing values, schema changes
- Inference Metrics: Latency, throughput, errors
- Resource Usage: CPU, GPU, memory, cost
- Business KPIs: Revenue impact, user engagement
# OpenTelemetry for ML Observability from opentelemetry import trace, metrics from opentelemetry.exporter.otlp.proto.grpc import ( trace_exporter, metrics_exporter ) # Setup tracing tracer = trace.get_tracer(__name__) meter = metrics.get_meter(__name__) # Create metrics prediction_counter = meter.create_counter( "predictions", description="Number of predictions made" ) latency_recorder = meter.create_histogram( "prediction_latency", description="Prediction latency in ms" ) def predict(request): with tracer.start_as_current_span("prediction") as span: # Add span attributes span.set_attribute("model.name", model_name) span.set_attribute("model.version", model_version) start_time = time.time() with tracer.start_as_current_span("preprocessing"): features = preprocess(request) with tracer.start_as_current_span("inference"): prediction = model.predict(features) # Record metrics latency = (time.time() - start_time) * 1000 latency_recorder.record(latency) prediction_counter.add(1, {"model": model_name}) return prediction
Alert Configuration
Critical Alerts:
- Model Drift: Performance degradation > 5%
- High Latency: P99 > SLA threshold
- Error Rate: Errors > 1% of requests
- Resource Exhaustion: GPU memory > 90%
- Data Drift: Feature distribution shift
# AlertManager configuration # alerting_rules.yml groups: - name: ml_alerts rules: - alert: ModelPerformanceDegradation expr: | rate(model_accuracy[5m]) < 0.95 for: 10m labels: severity: critical annotations: summary: "Model accuracy dropped below threshold" description: "Model {{ $labels.model }} accuracy is {{ $value }}" - alert: HighPredictionLatency expr: | histogram_quantile(0.99, rate(prediction_latency_bucket[5m]) ) > 0.5 for: 5m labels: severity: warning annotations: summary: "High prediction latency detected"
✅ Best Practices
ML Observability Guidelines
Implementation Strategy:
- Start with basic metrics (latency, errors, throughput)
- Add ML-specific metrics (accuracy, drift)
- Implement distributed tracing for complex pipelines
- Set up alerting for critical thresholds
- Create dashboards for different stakeholders
- Automate reporting and anomaly detection
Monitoring Stack Comparison:
Tool | Best For | Pros | Cons |
---|---|---|---|
MLflow | Experiment tracking | Open-source, simple | Limited monitoring |
W&B | Research teams | Great visualizations | Paid for teams |
Prometheus | Metrics collection | Powerful, flexible | Complex setup |
Datadog | Full-stack APM | Comprehensive | Expensive |
Common Pitfalls:
- Not tracking data drift from day one
- Missing business metric correlation
- Over-monitoring (alert fatigue)
- No baseline metrics before deployment
- Ignoring cost metrics
- Not monitoring feature pipeline
Module 7: Hands-On Projects Topics
- RAG Chatbot
- LLM Deployment
- Observability
- Multi-Agent Workflows