ML Model Lifecycle

Part of Module 4: AI Infrastructure

The ML Model Lifecycle encompasses the entire journey of a machine learning model from initial data collection through deployment and monitoring. Understanding each phase is crucial for building robust, scalable ML systems that deliver value in production environments.

📊 Data Phase

Data Collection & Preparation

The foundation of any ML project starts with high-quality data. This phase involves gathering, cleaning, and preparing data for model training.

Example Workflow

E-commerce company collects user behavior data → removes duplicates and outliers → normalizes features → creates train/validation/test splits → versions dataset for reproducibility.

Key Components:

  • Data Versioning: DVC, LakeFS, Delta Lake for tracking data changes
  • Feature Stores: Feast, Tecton, AWS SageMaker Feature Store for feature management
  • Data Validation: Great Expectations, TFDV for quality checks
  • ETL Pipelines: Apache Airflow, Prefect, Dagster for orchestration
# Example: Data versioning with DVC
import pandas as pd
import dvc.api

# Track data versions
data_url = dvc.api.get_url(
    path='data/training_data.csv',
    repo='https://github.com/company/ml-project',
    rev='v2.0'
)

# Load versioned data
df = pd.read_csv(data_url)
print(f"Loaded {len(df)} samples from v2.0")

Feature Engineering

Transform raw data into meaningful features that improve model performance.

Common Transformations

  • Timestamp → hour_of_day, day_of_week, is_weekend, is_holiday
  • Text → TF-IDF vectors, word embeddings, sentence embeddings
  • Categorical → one-hot encoding, target encoding, embeddings
  • Numerical → normalization, standardization, binning

🎯 Training Phase

Experimentation & Training

Iterative process of training models, tuning hyperparameters, and tracking experiments.

# MLflow experiment tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    
    # Train model
    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)
    
    # Log metrics
    accuracy = model.score(X_val, y_val)
    mlflow.log_metric("val_accuracy", accuracy)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")

Training Infrastructure Options:

  • Local Development: Jupyter notebooks, VS Code
  • Cloud Platforms: SageMaker, Vertex AI, Azure ML
  • Distributed Training: Ray, Horovod, PyTorch DDP
  • AutoML: H2O.ai, AutoGluon, TPOT

Model Validation

Rigorous testing to ensure model quality, fairness, and robustness before deployment.

Validation Checklist

  • ✓ Performance metrics meet thresholds (accuracy, F1, AUC)
  • ✓ No bias across demographic groups
  • ✓ Robust to adversarial inputs
  • ✓ Business KPIs aligned
  • ✓ A/B test design prepared

🚀 Deployment Phase

Model Packaging & Registry

Standardize models for deployment and maintain a central model registry.

Model Formats:

  • ONNX: Cross-platform standard
  • TorchScript: PyTorch production format
  • SavedModel: TensorFlow format
  • PMML: Traditional ML standard
  • Docker: Containerized models
# Model registry with MLflow
import mlflow.pyfunc

# Register model
model_uri = f"runs:/{run_id}/model"
model_name = "customer_churn_model"

result = mlflow.register_model(
    model_uri=model_uri,
    name=model_name
)

# Promote to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name=model_name,
    version=result.version,
    stage="Production"
)

Deployment Strategies

Different approaches to rolling out models to production systems.

Common Strategies

  • Blue-Green: Instant switchover between versions
  • Canary: Gradual rollout with monitoring (5% → 25% → 100%)
  • Shadow: Run new model alongside old without serving
  • A/B Testing: Compare model versions with real traffic
  • Multi-Armed Bandit: Dynamic traffic allocation based on performance

📈 Monitoring Phase

Model Monitoring

Continuous tracking of model performance, data drift, and system health in production.

Key Metrics to Track:

  • Performance: Accuracy, precision, recall over time
  • Data Drift: Input distribution changes
  • Concept Drift: Target distribution changes
  • System: Latency, throughput, error rates
  • Business: Revenue impact, user engagement
# Monitoring with Evidently
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metrics import DataDriftTable

# Create drift report
report = Report(metrics=[
    DataDriftTable(),
])

report.run(
    reference_data=train_df, 
    current_data=production_df,
    column_mapping=ColumnMapping()
)

# Check for drift
if report.as_dict()['metrics'][0]['result']['dataset_drift']:
    alert_team("Data drift detected!")

Retraining & Updates

Automated or triggered model retraining based on performance degradation or new data.

Retraining Triggers

  • ⏰ Scheduled (daily, weekly, monthly)
  • 📉 Performance threshold breach
  • 📊 Significant data drift detected
  • 📈 New data volume threshold reached
  • 📅 Business calendar events

Best Practices

✅ ML Lifecycle Management

Key Principles:

  • Reproducibility: Version everything - code, data, configs
  • Automation: CI/CD pipelines for ML
  • Monitoring: Track everything in production
  • Documentation: Model cards, data sheets
  • Governance: Approval workflows, audit trails

Tools Ecosystem:

  • End-to-end: Kubeflow, MLflow, Metaflow
  • Cloud: SageMaker, Vertex AI, Azure ML
  • Monitoring: Evidently, WhyLabs, Arize
  • Feature Stores: Feast, Tecton, Hopsworks
  • Orchestration: Airflow, Prefect, Dagster

⚠️ Common Pitfalls

  • Not versioning training data
  • Ignoring data drift monitoring
  • Manual deployment processes
  • Lack of rollback strategy
  • Missing business metric alignment
  • No feature drift detection
  • Treating models as static artifacts

Module 4: AI Infrastructure Topics