๐Ÿ”„ ML Lifecycle

Master the complete journey of machine learning projects from data collection to production deployment. Learn industry best practices, MLOps principles, and real-world deployment strategies.

๐Ÿ“Š Lifecycle Phases

1. Problem Definition

Identify business objectives, define success metrics, and determine if ML is the right solution.

# Define clear success metrics business_metrics = { 'accuracy_threshold': 0.95, 'latency_requirement': '< 100ms', 'cost_per_prediction': '< $0.001', 'roi_target': '3x within 6 months' }

2. Data Collection & Preparation

Gather, clean, and organize data. Implement feature engineering and data validation pipelines.

import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split # Data preparation pipeline df = pd.read_csv('raw_data.csv') df = df.dropna() df = df[df['value'] > 0] # Feature engineering df['new_feature'] = df['feature1'] / df['feature2'] df['category_encoded'] = pd.get_dummies(df['category']) # Split and scale X_train, X_test, y_train, y_test = train_test_split( df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42 ) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)

3. Model Development

Select algorithms, train models, and optimize hyperparameters using cross-validation.

from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV # Model development workflow rf = RandomForestClassifier(random_state=42) param_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [10, 20, None], 'min_samples_split': [2, 5, 10] } # Grid search with cross-validation grid_search = GridSearchCV( rf, param_grid, cv=5, scoring='f1_macro', n_jobs=-1 ) grid_search.fit(X_train_scaled, y_train) best_model = grid_search.best_estimator_

4. Model Evaluation

Assess model performance using multiple metrics and validate against business requirements.

from sklearn.metrics import ( accuracy_score, precision_recall_curve, roc_auc_score, confusion_matrix, classification_report ) # Evaluate model y_pred = best_model.predict(X_test_scaled) y_pred_proba = best_model.predict_proba(X_test_scaled) # Calculate metrics accuracy = accuracy_score(y_test, y_pred) auc_score = roc_auc_score(y_test, y_pred_proba[:, 1]) print(f"Accuracy: {accuracy:.3f}") print(f"AUC Score: {auc_score:.3f}") print(classification_report(y_test, y_pred))

5. Model Deployment

Deploy model to production environment with proper versioning and rollback capabilities.

# Model deployment with Flask from flask import Flask, request, jsonify import joblib app = Flask(__name__) model = joblib.load('model_v1.pkl') scaler = joblib.load('scaler.pkl') @app.route('/predict', methods=['POST']) def predict(): data = request.json features = scaler.transform([data['features']]) prediction = model.predict(features)[0] confidence = model.predict_proba(features)[0].max() return jsonify({ 'prediction': int(prediction), 'confidence': float(confidence) })

6. Monitoring & Maintenance

Track model performance, detect drift, and implement retraining pipelines.

# Model monitoring system class ModelMonitor: def __init__(self, baseline_metrics): self.baseline = baseline_metrics self.alerts = [] def check_drift(self, current_metrics): drift_detected = False for metric, value in current_metrics.items(): baseline_value = self.baseline.get(metric) if baseline_value: drift = abs(value - baseline_value) / baseline_value if drift > 0.1: # 10% threshold self.alerts.append({ 'metric': metric, 'drift': drift, 'action': 'retrain_required' }) drift_detected = True return drift_detected

๐ŸŽฏ ML Pipeline Simulator

๐Ÿ“Š
Data
๐Ÿงช
Train
โœ…
Evaluate
๐Ÿš€
Deploy
๐Ÿ“ˆ
Monitor
Pipeline Status
Ready to start...

๐Ÿ”ง MLOps & DevOps

๐Ÿ“ฆ Version Control

Track code, data, and model versions for reproducibility and collaboration.

# DVC for data version control $ dvc init $ dvc add data/training_data.csv $ git add data/training_data.csv.dvc $ git commit -m "Add training data v1.0" # MLflow for model versioning import mlflow import mlflow.sklearn with mlflow.start_run(): mlflow.log_param("n_estimators", 100) mlflow.log_metric("accuracy", accuracy) mlflow.sklearn.log_model(model, "model")

๐Ÿ”„ CI/CD Pipelines

Automate testing, validation, and deployment of ML models.

# GitHub Actions workflow name: ML Pipeline on: push: branches: [main] jobs: test-and-deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Run tests run: | pytest tests/ - name: Train model run: | python train.py - name: Evaluate model run: | python evaluate.py - name: Deploy if passing if: success() run: | python deploy.py

๐Ÿณ Containerization

Package models with dependencies for consistent deployment across environments.

# Dockerfile for ML model FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY model/ ./model/ COPY app.py . EXPOSE 5000 CMD ["gunicorn", "--bind", "0.0.0.0:5000", "app:app"]

๐Ÿ“Š Experiment Tracking

Log experiments, compare results, and manage model registry.

# Weights & Biases tracking import wandb wandb.init(project="ml-lifecycle") config = wandb.config config.learning_rate = 0.001 config.batch_size = 32 config.epochs = 100 for epoch in range(config.epochs): train_loss = train_epoch() val_loss = validate() wandb.log({ "train_loss": train_loss, "val_loss": val_loss, "epoch": epoch })

๐Ÿ” Security & Compliance

Implement security best practices and ensure regulatory compliance.

# Model security checks class SecurityValidator: def validate_input(self, data): # Check for SQL injection patterns if re.search(r'(DROP|DELETE|INSERT|UPDATE)', str(data)): raise ValueError("Suspicious input detected") # Validate data types if not isinstance(data, dict): raise TypeError("Invalid input format") # Check for PII if self.contains_pii(data): data = self.anonymize_pii(data) return data

โšก Infrastructure as Code

Define and manage ML infrastructure using code for scalability.

# Terraform configuration resource "aws_sagemaker_model" "ml_model" { name = "ml-lifecycle-model" execution_role_arn = aws_iam_role.sagemaker.arn primary_container { image = "${var.ecr_uri}:latest" model_data_url = "s3://${var.model_bucket}/model.tar.gz" } } resource "aws_sagemaker_endpoint" "ml_endpoint" { name = "ml-lifecycle-endpoint" endpoint_config_name = aws_sagemaker_endpoint_configuration.ml_config.name }
๐Ÿ’ก
MLOps Best Practices:
  • Automate everything: training, testing, deployment
  • Version control code, data, and models
  • Monitor model performance continuously
  • Implement gradual rollout strategies
  • Maintain reproducibility across environments

๐Ÿ› ๏ธ Tools & Platforms

โ˜๏ธ Cloud Platforms

Comprehensive ML services from major cloud providers.

  • AWS SageMaker: End-to-end ML platform
  • Google Vertex AI: Unified ML platform
  • Azure ML: Enterprise ML service
  • IBM Watson: AI and ML tools

๐Ÿ“ˆ Experiment Tracking

Tools for tracking experiments and managing models.

  • MLflow: Open-source platform
  • Weights & Biases: Experiment tracking
  • Neptune.ai: Metadata store
  • Comet ML: Model management

๐Ÿ”„ Orchestration

Workflow orchestration and pipeline management tools.

  • Apache Airflow: Workflow management
  • Kubeflow: K8s ML workflows
  • Prefect: Modern dataflow automation
  • Dagster: Data orchestrator

๐Ÿ“Š Data Management

Tools for data versioning and feature management.

  • DVC: Data version control
  • Feast: Feature store
  • Tecton: Feature platform
  • Great Expectations: Data validation

๐Ÿš€ Model Serving

Platforms for deploying and serving ML models.

  • TensorFlow Serving: TF model serving
  • TorchServe: PyTorch model serving
  • Seldon Core: K8s ML deployment
  • BentoML: Model packaging

๐Ÿ“‰ Monitoring

Tools for monitoring model performance and drift.

  • Evidently AI: ML monitoring
  • WhyLabs: Model observability
  • Arize: ML observability platform
  • Prometheus: Metrics monitoring

๐Ÿ” Platform Comparison Tool

๐Ÿš€ Deployment Strategies

๐Ÿ”„ Blue-Green Deployment

Deploy new version alongside the old, then switch traffic instantly.

# Blue-Green deployment with Kubernetes apiVersion: v1 kind: Service metadata: name: ml-model-service spec: selector: app: ml-model version: green # Switch between blue/green ports: - port: 80 targetPort: 5000

๐ŸŽฏ Canary Deployment

Gradually roll out new model to a small percentage of users.

# Canary deployment configuration class CanaryDeployment: def __init__(self, canary_percentage=5): self.canary_percentage = canary_percentage def route_request(self, request_id): if hash(request_id) % 100 < self.canary_percentage: return "new_model" return "stable_model"

๐ŸŒŠ Rolling Deployment

Update instances one at a time with zero downtime.

# Rolling update configuration apiVersion: apps/v1 kind: Deployment metadata: name: ml-model spec: replicas: 10 strategy: type: RollingUpdate rollingUpdate: maxSurge: 2 maxUnavailable: 1

โšก Serverless Deployment

Deploy models as serverless functions for automatic scaling.

# AWS Lambda deployment import json import boto3 import joblib model = joblib.load('/tmp/model.pkl') def lambda_handler(event, context): features = json.loads(event['body'])['features'] prediction = model.predict([features])[0] return { 'statusCode': 200, 'body': json.dumps({ 'prediction': int(prediction) }) }

๐Ÿ”ง Edge Deployment

Deploy models to edge devices for low-latency inference.

# TensorFlow Lite edge deployment import tensorflow as tf # Convert model to TFLite converter = tf.lite.TFLiteConverter.from_saved_model('model/') converter.optimizations = [tf.lite.Optimize.DEFAULT] tflite_model = converter.convert() # Save for edge device with open('model.tflite', 'wb') as f: f.write(tflite_model)

๐Ÿ”„ A/B Testing

Compare model versions to determine the best performer.

# A/B testing framework class ABTest: def __init__(self, models, split_ratio): self.models = models self.split_ratio = split_ratio self.metrics = {name: [] for name in models} def assign_model(self, user_id): if hash(user_id) % 100 < self.split_ratio * 100: return 'model_a' return 'model_b' def record_outcome(self, model_name, metric_value): self.metrics[model_name].append(metric_value)
โœ…
Deployment Checklist:
  • Model versioning and rollback plan ready
  • API documentation and client libraries
  • Load testing and performance benchmarks
  • Monitoring and alerting configured
  • Security audit and compliance check
  • Disaster recovery plan in place

๐Ÿ“ˆ Monitoring & Maintenance

99.9%
Uptime
45ms
Avg Latency
94.2%
Accuracy
1.2M
Daily Predictions

๐Ÿ“Š Performance Monitoring

Track model accuracy, latency, and resource utilization.

# Performance monitoring import time from prometheus_client import Counter, Histogram, Gauge # Define metrics prediction_counter = Counter('ml_predictions_total', 'Total predictions') latency_histogram = Histogram('ml_prediction_latency_seconds', 'Prediction latency') accuracy_gauge = Gauge('ml_model_accuracy', 'Current model accuracy') @latency_histogram.time() def predict(features): prediction_counter.inc() result = model.predict(features) return result

๐ŸŽฏ Data Drift Detection

Monitor input data distribution changes over time.

# Data drift detection from scipy import stats import numpy as np class DriftDetector: def __init__(self, reference_data, threshold=0.05): self.reference = reference_data self.threshold = threshold def detect_drift(self, current_data): drift_detected = [] for column in self.reference.columns: # Kolmogorov-Smirnov test statistic, p_value = stats.ks_2samp( self.reference[column], current_data[column] ) if p_value < self.threshold: drift_detected.append({ 'feature': column, 'p_value': p_value, 'drift': True }) return drift_detected

๐Ÿ” Model Explainability

Understand and explain model predictions for transparency.

# SHAP for model explainability import shap # Create explainer explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) # Get feature importance feature_importance = pd.DataFrame({ 'feature': X_test.columns, 'importance': np.abs(shap_values).mean(0) }).sort_values('importance', ascending=False) # Explain single prediction shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])

โš ๏ธ Alerting System

Set up alerts for model degradation and system issues.

# Alert configuration alerts = { 'accuracy_drop': { 'threshold': 0.85, 'severity': 'critical', 'action': 'page_oncall' }, 'latency_spike': { 'threshold': 200, # ms 'severity': 'warning', 'action': 'send_email' }, 'error_rate': { 'threshold': 0.01, # 1% 'severity': 'critical', 'action': 'auto_rollback' } }

๐Ÿ”„ Automated Retraining

Implement pipelines for automatic model retraining.

# Automated retraining pipeline class RetrainingPipeline: def __init__(self, schedule='weekly'): self.schedule = schedule self.last_training = datetime.now() def should_retrain(self): conditions = [ self.check_schedule(), self.check_performance_drop(), self.check_data_drift(), self.check_new_data_volume() ] return any(conditions) def retrain(self): # Pull latest data new_data = self.fetch_latest_data() # Train new model new_model = self.train_model(new_data) # Validate performance if self.validate_model(new_model): self.deploy_model(new_model) self.last_training = datetime.now()

๐Ÿ“ Audit Logging

Maintain comprehensive logs for debugging and compliance.

# Structured logging import logging import json class MLAuditLogger: def __init__(self): self.logger = logging.getLogger('ml_audit') def log_prediction(self, request_id, features, prediction, confidence, latency): log_entry = { 'timestamp': datetime.now().isoformat(), 'request_id': request_id, 'model_version': self.model_version, 'features_hash': hash(str(features)), 'prediction': prediction, 'confidence': confidence, 'latency_ms': latency } self.logger.info(json.dumps(log_entry))

๐Ÿ“Š Live Monitoring Dashboard

95.2%
Live Accuracy
42ms
Live Latency
1,234
Requests/min
โœ… Normal
Drift Status

๐ŸŽฏ Practice Exercises

๐Ÿ“ Exercise 1: Build a Pipeline

Create an end-to-end ML pipeline for a classification problem.

# Your task: Complete the pipeline class MLPipeline: def __init__(self): self.model = None self.scaler = None def load_data(self, path): # TODO: Load and validate data pass def preprocess(self, data): # TODO: Clean and transform data pass def train(self, X_train, y_train): # TODO: Train and optimize model pass def evaluate(self, X_test, y_test): # TODO: Evaluate model performance pass def deploy(self): # TODO: Deploy model to production pass

๐Ÿ” Exercise 2: Implement Monitoring

Add monitoring capabilities to track model performance.

# Your task: Add monitoring class ModelMonitor: def __init__(self, model): self.model = model self.metrics = [] def log_prediction(self, features, prediction): # TODO: Log prediction details pass def calculate_metrics(self): # TODO: Calculate performance metrics pass def detect_drift(self, new_data): # TODO: Check for data drift pass def send_alert(self, message): # TODO: Send alert notification pass

๐Ÿš€ Exercise 3: Deploy with Docker

Containerize and deploy your ML model using Docker.

# Your task: Complete the Dockerfile FROM python:3.9-slim WORKDIR /app # TODO: Copy requirements COPY ? . # TODO: Install dependencies RUN ? # TODO: Copy application code COPY ? . # TODO: Expose port EXPOSE ? # TODO: Define entry point CMD [?]
๐ŸŽฏ Key Takeaways
1. Automate Everything: From data validation to model deployment 2. Version Control: Track code, data, and models 3. Monitor Continuously: Watch for drift and degradation 4. Test Rigorously: Unit tests, integration tests, A/B tests 5. Document Thoroughly: APIs, models, and processes 6. Plan for Failure: Have rollback and recovery strategies