📊 Data Phase
Data Collection & Preparation
The foundation of any ML project starts with high-quality data. This phase involves gathering, cleaning, and preparing data for model training.
Example Workflow
E-commerce company collects user behavior data → removes duplicates and outliers → normalizes features → creates train/validation/test splits → versions dataset for reproducibility.
Key Components:
- Data Versioning: DVC, LakeFS, Delta Lake for tracking data changes
- Feature Stores: Feast, Tecton, AWS SageMaker Feature Store for feature management
- Data Validation: Great Expectations, TFDV for quality checks
- ETL Pipelines: Apache Airflow, Prefect, Dagster for orchestration
# Example: Data versioning with DVC import pandas as pd import dvc.api # Track data versions data_url = dvc.api.get_url( path='data/training_data.csv', repo='https://github.com/company/ml-project', rev='v2.0' ) # Load versioned data df = pd.read_csv(data_url) print(f"Loaded {len(df)} samples from v2.0")
Feature Engineering
Transform raw data into meaningful features that improve model performance.
Common Transformations
- Timestamp → hour_of_day, day_of_week, is_weekend, is_holiday
- Text → TF-IDF vectors, word embeddings, sentence embeddings
- Categorical → one-hot encoding, target encoding, embeddings
- Numerical → normalization, standardization, binning
🎯 Training Phase
Experimentation & Training
Iterative process of training models, tuning hyperparameters, and tracking experiments.
# MLflow experiment tracking import mlflow import mlflow.sklearn from sklearn.ensemble import RandomForestClassifier with mlflow.start_run(): # Log parameters mlflow.log_param("n_estimators", 100) mlflow.log_param("max_depth", 10) # Train model model = RandomForestClassifier(n_estimators=100, max_depth=10) model.fit(X_train, y_train) # Log metrics accuracy = model.score(X_val, y_val) mlflow.log_metric("val_accuracy", accuracy) # Log model mlflow.sklearn.log_model(model, "model")
Training Infrastructure Options:
- Local Development: Jupyter notebooks, VS Code
- Cloud Platforms: SageMaker, Vertex AI, Azure ML
- Distributed Training: Ray, Horovod, PyTorch DDP
- AutoML: H2O.ai, AutoGluon, TPOT
Model Validation
Rigorous testing to ensure model quality, fairness, and robustness before deployment.
Validation Checklist
- ✓ Performance metrics meet thresholds (accuracy, F1, AUC)
- ✓ No bias across demographic groups
- ✓ Robust to adversarial inputs
- ✓ Business KPIs aligned
- ✓ A/B test design prepared
🚀 Deployment Phase
Model Packaging & Registry
Standardize models for deployment and maintain a central model registry.
Model Formats:
- ONNX: Cross-platform standard
- TorchScript: PyTorch production format
- SavedModel: TensorFlow format
- PMML: Traditional ML standard
- Docker: Containerized models
# Model registry with MLflow import mlflow.pyfunc # Register model model_uri = f"runs:/{run_id}/model" model_name = "customer_churn_model" result = mlflow.register_model( model_uri=model_uri, name=model_name ) # Promote to production client = mlflow.tracking.MlflowClient() client.transition_model_version_stage( name=model_name, version=result.version, stage="Production" )
Deployment Strategies
Different approaches to rolling out models to production systems.
Common Strategies
- Blue-Green: Instant switchover between versions
- Canary: Gradual rollout with monitoring (5% → 25% → 100%)
- Shadow: Run new model alongside old without serving
- A/B Testing: Compare model versions with real traffic
- Multi-Armed Bandit: Dynamic traffic allocation based on performance
📈 Monitoring Phase
Model Monitoring
Continuous tracking of model performance, data drift, and system health in production.
Key Metrics to Track:
- Performance: Accuracy, precision, recall over time
- Data Drift: Input distribution changes
- Concept Drift: Target distribution changes
- System: Latency, throughput, error rates
- Business: Revenue impact, user engagement
# Monitoring with Evidently from evidently import ColumnMapping from evidently.report import Report from evidently.metrics import DataDriftTable # Create drift report report = Report(metrics=[ DataDriftTable(), ]) report.run( reference_data=train_df, current_data=production_df, column_mapping=ColumnMapping() ) # Check for drift if report.as_dict()['metrics'][0]['result']['dataset_drift']: alert_team("Data drift detected!")
Retraining & Updates
Automated or triggered model retraining based on performance degradation or new data.
Retraining Triggers
- ⏰ Scheduled (daily, weekly, monthly)
- 📉 Performance threshold breach
- 📊 Significant data drift detected
- 📈 New data volume threshold reached
- 📅 Business calendar events
Best Practices
✅ ML Lifecycle Management
Key Principles:
- Reproducibility: Version everything - code, data, configs
- Automation: CI/CD pipelines for ML
- Monitoring: Track everything in production
- Documentation: Model cards, data sheets
- Governance: Approval workflows, audit trails
Tools Ecosystem:
- End-to-end: Kubeflow, MLflow, Metaflow
- Cloud: SageMaker, Vertex AI, Azure ML
- Monitoring: Evidently, WhyLabs, Arize
- Feature Stores: Feast, Tecton, Hopsworks
- Orchestration: Airflow, Prefect, Dagster
⚠️ Common Pitfalls
- Not versioning training data
- Ignoring data drift monitoring
- Manual deployment processes
- Lack of rollback strategy
- Missing business metric alignment
- No feature drift detection
- Treating models as static artifacts
Module 4: AI Infrastructure Topics
- ML Model Lifecycle
- Serving Frameworks
- MLOps & AIOps
- GPU Orchestration