ML Lifecycle - AI Training Hub

📊 Lifecycle Phases

1. Problem Definition

Identify business objectives, define success metrics, and determine if ML is the right solution.

# Define clear success metrics
business_metrics = {
    'accuracy_threshold': 0.95,
    'latency_requirement': '< 100ms',
    'cost_per_prediction': '< $0.001',
    'roi_target': '3x within 6 months'
}

2. Data Collection & Preparation

Gather, clean, and organize data. Implement feature engineering and data validation pipelines.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Data preparation pipeline
df = pd.read_csv('raw_data.csv')
df = df.dropna()
df = df[df['value'] > 0]

# Feature engineering
df['new_feature'] = df['feature1'] / df['feature2']
df['category_encoded'] = pd.get_dummies(df['category'])

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('target', axis=1), 
    df['target'], 
    test_size=0.2,
    random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

3. Model Development

Select algorithms, train models, and optimize hyperparameters using cross-validation.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Model development workflow
rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10]
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    rf, param_grid, 
    cv=5, 
    scoring='f1_macro',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)
best_model = grid_search.best_estimator_

4. Model Evaluation

Assess model performance using multiple metrics and validate against business requirements.

from sklearn.metrics import (
    accuracy_score, precision_recall_curve,
    roc_auc_score, confusion_matrix,
    classification_report
)

# Evaluate model
y_pred = best_model.predict(X_test_scaled)
y_pred_proba = best_model.predict_proba(X_test_scaled)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
auc_score = roc_auc_score(y_test, y_pred_proba[:, 1])

print(f"Accuracy: {accuracy:.3f}")
print(f"AUC Score: {auc_score:.3f}")
print(classification_report(y_test, y_pred))

5. Model Deployment

Deploy model to production environment with proper versioning and rollback capabilities.

# Model deployment with Flask
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('model_v1.pkl')
scaler = joblib.load('scaler.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    features = scaler.transform([data['features']])
    prediction = model.predict(features)[0]
    confidence = model.predict_proba(features)[0].max()
    
    return jsonify({
        'prediction': int(prediction),
        'confidence': float(confidence)
    })

6. Monitoring & Maintenance

Track model performance, detect drift, and implement retraining pipelines.

# Model monitoring system
class ModelMonitor:
    def __init__(self, baseline_metrics):
        self.baseline = baseline_metrics
        self.alerts = []
    
    def check_drift(self, current_metrics):
        drift_detected = False
        
        for metric, value in current_metrics.items():
            baseline_value = self.baseline.get(metric)
            if baseline_value:
                drift = abs(value - baseline_value) / baseline_value
                if drift > 0.1:  # 10% threshold
                    self.alerts.append({
                        'metric': metric,
                        'drift': drift,
                        'action': 'retrain_required'
                    })
                    drift_detected = True
        
        return drift_detected

🎯 ML Pipeline Simulator

📊

Data

🧪

Train

✅

Evaluate

🚀

Deploy

📈

Monitor

Pipeline Status

Ready to start...

🔧 MLOps & DevOps

📦 Version Control

Track code, data, and model versions for reproducibility and collaboration.

# DVC for data version control
$ dvc init
$ dvc add data/training_data.csv
$ git add data/training_data.csv.dvc
$ git commit -m "Add training data v1.0"

# MLflow for model versioning
import mlflow
import mlflow.sklearn

with mlflow.start_run():
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.sklearn.log_model(model, "model")

🔄 CI/CD Pipelines

Automate testing, validation, and deployment of ML models.

# GitHub Actions workflow
name: ML Pipeline

on:
  push:
    branches: [main]

jobs:
  test-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Run tests
        run: |
          pytest tests/
          
      - name: Train model
        run: |
          python train.py
          
      - name: Evaluate model
        run: |
          python evaluate.py
          
      - name: Deploy if passing
        if: success()
        run: |
          python deploy.py

🐳 Containerization

Package models with dependencies for consistent deployment across environments.

# Dockerfile for ML model
FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY model/ ./model/
COPY app.py .

EXPOSE 5000

CMD ["gunicorn", "--bind", "0.0.0.0:5000", "app:app"]

📊 Experiment Tracking

Log experiments, compare results, and manage model registry.

# Weights & Biases tracking
import wandb

wandb.init(project="ml-lifecycle")

config = wandb.config
config.learning_rate = 0.001
config.batch_size = 32
config.epochs = 100

for epoch in range(config.epochs):
    train_loss = train_epoch()
    val_loss = validate()
    
    wandb.log({
        "train_loss": train_loss,
        "val_loss": val_loss,
        "epoch": epoch
    })

🔐 Security & Compliance

Implement security best practices and ensure regulatory compliance.

# Model security checks
class SecurityValidator:
    def validate_input(self, data):
        # Check for SQL injection patterns
        if re.search(r'(DROP|DELETE|INSERT|UPDATE)', str(data)):
            raise ValueError("Suspicious input detected")
        
        # Validate data types
        if not isinstance(data, dict):
            raise TypeError("Invalid input format")
        
        # Check for PII
        if self.contains_pii(data):
            data = self.anonymize_pii(data)
        
        return data

⚡ Infrastructure as Code

Define and manage ML infrastructure using code for scalability.

# Terraform configuration
resource "aws_sagemaker_model" "ml_model" {
  name               = "ml-lifecycle-model"
  execution_role_arn = aws_iam_role.sagemaker.arn

  primary_container {
    image          = "${var.ecr_uri}:latest"
    model_data_url = "s3://${var.model_bucket}/model.tar.gz"
  }
}

resource "aws_sagemaker_endpoint" "ml_endpoint" {
  name                 = "ml-lifecycle-endpoint"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.ml_config.name
}

💡

MLOps Best Practices:

Automate everything: training, testing, deployment
Version control code, data, and models
Monitor model performance continuously
Implement gradual rollout strategies
Maintain reproducibility across environments

🛠️ Tools & Platforms

☁️ Cloud Platforms

Comprehensive ML services from major cloud providers.

AWS SageMaker: End-to-end ML platform
Google Vertex AI: Unified ML platform
Azure ML: Enterprise ML service
IBM Watson: AI and ML tools

📈 Experiment Tracking

Tools for tracking experiments and managing models.

MLflow: Open-source platform
Weights & Biases: Experiment tracking
Neptune.ai: Metadata store
Comet ML: Model management

🔄 Orchestration

Workflow orchestration and pipeline management tools.

Apache Airflow: Workflow management
Kubeflow: K8s ML workflows
Prefect: Modern dataflow automation
Dagster: Data orchestrator

📊 Data Management

Tools for data versioning and feature management.

DVC: Data version control
Feast: Feature store
Tecton: Feature platform
Great Expectations: Data validation

🚀 Model Serving

Platforms for deploying and serving ML models.

TensorFlow Serving: TF model serving
TorchServe: PyTorch model serving
Seldon Core: K8s ML deployment
BentoML: Model packaging

📉 Monitoring

Tools for monitoring model performance and drift.

Evidently AI: ML monitoring
WhyLabs: Model observability
Arize: ML observability platform
Prometheus: Metrics monitoring

🔍 Platform Comparison Tool

Select Use Case:

🚀 Deployment Strategies

🔄 Blue-Green Deployment

Deploy new version alongside the old, then switch traffic instantly.

# Blue-Green deployment with Kubernetes
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
    version: green  # Switch between blue/green
  ports:
    - port: 80
      targetPort: 5000

🎯 Canary Deployment

Gradually roll out new model to a small percentage of users.

# Canary deployment configuration
class CanaryDeployment:
    def __init__(self, canary_percentage=5):
        self.canary_percentage = canary_percentage
    
    def route_request(self, request_id):
        if hash(request_id) % 100 < self.canary_percentage:
            return "new_model"
        return "stable_model"

🌊 Rolling Deployment

Update instances one at a time with zero downtime.

# Rolling update configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1

⚡ Serverless Deployment

Deploy models as serverless functions for automatic scaling.

# AWS Lambda deployment
import json
import boto3
import joblib

model = joblib.load('/tmp/model.pkl')

def lambda_handler(event, context):
    features = json.loads(event['body'])['features']
    prediction = model.predict([features])[0]
    
    return {
        'statusCode': 200,
        'body': json.dumps({
            'prediction': int(prediction)
        })
    }

🔧 Edge Deployment

Deploy models to edge devices for low-latency inference.

# TensorFlow Lite edge deployment
import tensorflow as tf

# Convert model to TFLite
converter = tf.lite.TFLiteConverter.from_saved_model('model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# Save for edge device
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

🔄 A/B Testing

Compare model versions to determine the best performer.

# A/B testing framework
class ABTest:
    def __init__(self, models, split_ratio):
        self.models = models
        self.split_ratio = split_ratio
        self.metrics = {name: [] for name in models}
    
    def assign_model(self, user_id):
        if hash(user_id) % 100 < self.split_ratio * 100:
            return 'model_a'
        return 'model_b'
    
    def record_outcome(self, model_name, metric_value):
        self.metrics[model_name].append(metric_value)

✅

Deployment Checklist:

Model versioning and rollback plan ready
API documentation and client libraries
Load testing and performance benchmarks
Monitoring and alerting configured
Security audit and compliance check
Disaster recovery plan in place

📈 Monitoring & Maintenance

99.9%

Uptime

45ms

Avg Latency

94.2%

Accuracy

1.2M

Daily Predictions

📊 Performance Monitoring

Track model accuracy, latency, and resource utilization.

# Performance monitoring
import time
from prometheus_client import Counter, Histogram, Gauge

# Define metrics
prediction_counter = Counter('ml_predictions_total', 
                            'Total predictions')
latency_histogram = Histogram('ml_prediction_latency_seconds',
                             'Prediction latency')
accuracy_gauge = Gauge('ml_model_accuracy', 
                      'Current model accuracy')

@latency_histogram.time()
def predict(features):
    prediction_counter.inc()
    result = model.predict(features)
    return result

🎯 Data Drift Detection

Monitor input data distribution changes over time.

# Data drift detection
from scipy import stats
import numpy as np

class DriftDetector:
    def __init__(self, reference_data, threshold=0.05):
        self.reference = reference_data
        self.threshold = threshold
    
    def detect_drift(self, current_data):
        drift_detected = []
        
        for column in self.reference.columns:
            # Kolmogorov-Smirnov test
            statistic, p_value = stats.ks_2samp(
                self.reference[column], 
                current_data[column]
            )
            
            if p_value < self.threshold:
                drift_detected.append({
                    'feature': column,
                    'p_value': p_value,
                    'drift': True
                })
        
        return drift_detected

🔍 Model Explainability

Understand and explain model predictions for transparency.

# SHAP for model explainability
import shap

# Create explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X_test.columns,
    'importance': np.abs(shap_values).mean(0)
}).sort_values('importance', ascending=False)

# Explain single prediction
shap.force_plot(explainer.expected_value, 
                shap_values[0], 
                X_test.iloc[0])

⚠️ Alerting System

Set up alerts for model degradation and system issues.

# Alert configuration
alerts = {
    'accuracy_drop': {
        'threshold': 0.85,
        'severity': 'critical',
        'action': 'page_oncall'
    },
    'latency_spike': {
        'threshold': 200,  # ms
        'severity': 'warning',
        'action': 'send_email'
    },
    'error_rate': {
        'threshold': 0.01,  # 1%
        'severity': 'critical',
        'action': 'auto_rollback'
    }
}

🔄 Automated Retraining

Implement pipelines for automatic model retraining.

# Automated retraining pipeline
class RetrainingPipeline:
    def __init__(self, schedule='weekly'):
        self.schedule = schedule
        self.last_training = datetime.now()
    
    def should_retrain(self):
        conditions = [
            self.check_schedule(),
            self.check_performance_drop(),
            self.check_data_drift(),
            self.check_new_data_volume()
        ]
        return any(conditions)
    
    def retrain(self):
        # Pull latest data
        new_data = self.fetch_latest_data()
        
        # Train new model
        new_model = self.train_model(new_data)
        
        # Validate performance
        if self.validate_model(new_model):
            self.deploy_model(new_model)
            self.last_training = datetime.now()

📝 Audit Logging

Maintain comprehensive logs for debugging and compliance.

# Structured logging
import logging
import json

class MLAuditLogger:
    def __init__(self):
        self.logger = logging.getLogger('ml_audit')
    
    def log_prediction(self, request_id, features, 
                         prediction, confidence, latency):
        log_entry = {
            'timestamp': datetime.now().isoformat(),
            'request_id': request_id,
            'model_version': self.model_version,
            'features_hash': hash(str(features)),
            'prediction': prediction,
            'confidence': confidence,
            'latency_ms': latency
        }
        self.logger.info(json.dumps(log_entry))

📊 Live Monitoring Dashboard

95.2%

Live Accuracy

42ms

Live Latency

1,234

Requests/min

✅ Normal

Drift Status

🎯 Practice Exercises

📝 Exercise 1: Build a Pipeline

Create an end-to-end ML pipeline for a classification problem.

# Your task: Complete the pipeline
class MLPipeline:
    def __init__(self):
        self.model = None
        self.scaler = None
    
    def load_data(self, path):
        # TODO: Load and validate data
        pass
    
    def preprocess(self, data):
        # TODO: Clean and transform data
        pass
    
    def train(self, X_train, y_train):
        # TODO: Train and optimize model
        pass
    
    def evaluate(self, X_test, y_test):
        # TODO: Evaluate model performance
        pass
    
    def deploy(self):
        # TODO: Deploy model to production
        pass

🔍 Exercise 2: Implement Monitoring

Add monitoring capabilities to track model performance.

# Your task: Add monitoring
class ModelMonitor:
    def __init__(self, model):
        self.model = model
        self.metrics = []
    
    def log_prediction(self, features, prediction):
        # TODO: Log prediction details
        pass
    
    def calculate_metrics(self):
        # TODO: Calculate performance metrics
        pass
    
    def detect_drift(self, new_data):
        # TODO: Check for data drift
        pass
    
    def send_alert(self, message):
        # TODO: Send alert notification
        pass

🚀 Exercise 3: Deploy with Docker

Containerize and deploy your ML model using Docker.

# Your task: Complete the Dockerfile
FROM python:3.9-slim

WORKDIR /app

# TODO: Copy requirements
COPY ? .

# TODO: Install dependencies
RUN ?

# TODO: Copy application code
COPY ? .

# TODO: Expose port
EXPOSE ?

# TODO: Define entry point
CMD [?]

📚

Learning Resources:

🎯 Key Takeaways

1. Automate Everything: From data validation to model deployment 2. Version Control: Track code, data, and models 3. Monitor Continuously: Watch for drift and degradation 4. Test Rigorously: Unit tests, integration tests, A/B tests 5. Document Thoroughly: APIs, models, and processes 6. Plan for Failure: Have rollback and recovery strategies