MLOps & AIOps

Part of Module 4: AI in Production

🔄 Understanding the Ops Landscape

DevOps

Meaning: Practices combining software development and IT operations to shorten development lifecycle.
Example: Team uses Jenkins CI/CD → automated testing → deploy to Kubernetes → monitor with Datadog.

Core Practices:

  • Continuous Integration (CI)
  • Continuous Deployment (CD)
  • Infrastructure as Code (IaC)
  • Monitoring and Logging
  • Collaboration and Communication

MLOps

Meaning: DevOps principles applied to machine learning systems, adding data and model lifecycle management.
Example: Data pipeline triggers retraining → model validation → A/B testing → gradual rollout → performance monitoring.

Additional Concerns:

  • Data Versioning: Track dataset changes
  • Experiment Tracking: Log all training runs
  • Model Registry: Version and store models
  • Feature Store: Consistent feature computation
  • Drift Detection: Monitor data/concept drift
# MLOps pipeline with DVC and MLflow
# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw
    outs:
      - data/processed
  
  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed
    params:
      - train.epochs
      - train.learning_rate
    metrics:
      - metrics.json
    outs:
      - models/model.pkl

# MLflow tracking in train.py
mlflow.log_params(params)
mlflow.log_metrics(metrics)
mlflow.log_model(model, "model")

AIOps

Meaning: Using AI/ML to enhance IT operations - automated anomaly detection, root cause analysis, and self-healing systems.
Example: AI system detects unusual API latency → predicts server failure → auto-scales resources → prevents outage.

Key Capabilities:

  • Anomaly Detection: Identify unusual patterns
  • Predictive Analytics: Forecast failures
  • Root Cause Analysis: Automated debugging
  • Auto-remediation: Self-healing systems
  • Intelligent Alerting: Reduce alert fatigue

🚀 CI/CD for Machine Learning

ML Pipeline Architecture

Meaning: Automated workflow from data ingestion through model deployment and monitoring.
Example: GitHub push → triggers data validation → model training → automated testing → staging deployment → production release.
# GitHub Actions ML Pipeline
# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 0 * * 0'  # Weekly retraining

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Validate Data Quality
        run: |
          python scripts/validate_data.py
          great_expectations checkpoint run data_quality

  train-model:
    needs: validate-data
    runs-on: ubuntu-latest
    steps:
      - name: Train Model
        run: |
          python scripts/train.py
          mlflow models validate model/
      
      - name: Register Model
        run: |
          mlflow models register \
            --model-uri runs:/${{ env.RUN_ID }}/model \
            --name production-model

  deploy:
    needs: train-model
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Staging
        run: |
          kubectl apply -f k8s/staging/
          python scripts/smoke_test.py --env staging
      
      - name: Run A/B Test
        run: |
          python scripts/ab_test.py --traffic 0.1
          sleep 3600  # Monitor for 1 hour
          python scripts/evaluate_ab.py
      
      - name: Promote to Production
        if: success()
        run: |
          kubectl set image deployment/model-server \
            model=${{ env.MODEL_IMAGE }}:${{ env.VERSION }}

Pipeline Components:

  • Data Pipeline: Ingestion → Validation → Transformation
  • Training Pipeline: Feature Engineering → Training → Evaluation
  • Model Pipeline: Validation → Registry → Deployment
  • Monitoring Pipeline: Metrics → Alerts → Retraining Triggers

Testing Strategies

Types of ML Tests:

  • Data Tests: Schema validation, distribution checks
  • Model Tests: Performance thresholds, fairness metrics
  • Integration Tests: API contracts, latency requirements
  • Drift Tests: Feature/prediction drift detection
# Model testing with pytest
import pytest
import numpy as np
from model import load_model, predict

class TestModel:
    def test_model_performance(self):
        """Test model meets performance threshold"""
        model = load_model("latest")
        X_test, y_test = load_test_data()
        
        predictions = model.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        
        assert accuracy > 0.95, f"Accuracy {accuracy} below threshold"
    
    def test_inference_latency(self):
        """Test prediction latency requirement"""
        model = load_model("latest")
        sample = np.random.randn(1, 100)
        
        start = time.time()
        _ = model.predict(sample)
        latency = time.time() - start
        
        assert latency < 0.1, f"Latency {latency}s exceeds 100ms"
    
    def test_model_fairness(self):
        """Test model fairness across groups"""
        model = load_model("latest")
        
        for group in ['group_a', 'group_b']:
            X, y = load_group_data(group)
            accuracy = model.score(X, y)
            assert accuracy > 0.93, f"Bias detected for {group}"

🛠️ MLOps Tool Ecosystem

End-to-End Platforms

Major Platforms:

  • Kubeflow: Kubernetes-native ML workflows
  • MLflow: Open-source lifecycle management
  • Metaflow: Netflix's human-centric framework
  • SageMaker: AWS managed ML platform
  • Vertex AI: Google Cloud ML platform
  • Azure ML: Microsoft's ML platform

Tool Categories:

Category Tools Purpose
Experiment Tracking MLflow, W&B, Neptune Log experiments
Data Versioning DVC, LakeFS, Pachyderm Version datasets
Feature Store Feast, Tecton, Hopsworks Feature management
Model Registry MLflow, Seldon, BentoML Model versioning
Orchestration Airflow, Prefect, Dagster Pipeline automation
Monitoring Evidently, WhyLabs, Arize Production monitoring

Platform Comparison

Decision Factors:

  • Cloud vs On-Premise: Managed services vs control
  • Scale: Team size and model volume
  • Complexity: Simple models vs complex pipelines
  • Budget: Open-source vs enterprise
  • Ecosystem: Existing tools and infrastructure

Typical Stack Examples:

  • Startup: MLflow + DVC + GitHub Actions + Heroku
  • Scale-up: Kubeflow + Feast + Seldon + K8s
  • Enterprise: SageMaker + Step Functions + CloudWatch
  • Research: W&B + Papermill + Colab + GCS

✅ MLOps Best Practices

Maturity Model

Level 0: Manual Process

  • Manual, script-driven process
  • Interactive notebooks
  • No CI/CD for ML
  • Manual deployment

Level 1: ML Pipeline Automation

  • Automated training pipeline
  • Experiment tracking
  • Model registry
  • Metadata logging

Level 2: CI/CD Pipeline Automation

  • Source control for code AND data
  • Automated testing
  • CI/CD for training pipeline
  • Automated deployment
  • Production monitoring

Implementation Guidelines:

  • Start with experiment tracking
  • Version everything (code, data, models)
  • Automate gradually
  • Monitor from day one
  • Document model decisions
  • Plan for model updates
  • Build for reproducibility

Common Anti-patterns:

  • No separation between dev/staging/prod
  • Missing rollback strategy
  • Ignoring data quality issues
  • Manual model deployment
  • No monitoring after deployment
  • Treating models as static artifacts

Module 4: AI in Production Topics