Serving Frameworks

Model serving frameworks are the backbone of production AI systems, handling everything from inference optimization to scalable deployment. This comprehensive guide explores the leading frameworks, their unique strengths, and when to use each one for maximum performance and reliability in your ML pipeline.

Model Serving Architecture Landscape

Core Serving Concepts

Model Serving transforms trained models into production-ready inference endpoints that handle real-world traffic at scale.

Key Performance Metrics:

Throughput: Requests processed per second (RPS/QPS)
Latency: Time from request to response (P50, P95, P99)
Resource Utilization: GPU/CPU/Memory efficiency
Scalability: Ability to handle traffic spikes

Optimization Strategies:

Batching: Process multiple requests together
Quantization: Reduce model precision (FP16, INT8)
Caching: Store frequent results
Model Compilation: Optimize computation graphs

⚡ vLLM - High-Performance LLM Serving

What is vLLM?

Meaning: Fast and easy-to-use library for LLM inference and serving, optimized for high throughput.

Example: Company serves GPT models 24x faster using vLLM's PagedAttention → reduces GPU memory usage by 50%.

Key Features:

PagedAttention: Efficient memory management
Continuous Batching: Dynamic request batching
Quantization: INT4/INT8 support
Tensor Parallelism: Multi-GPU serving
OpenAI Compatible: Drop-in replacement API

# vLLM server setup
from vllm import LLM, SamplingParams

# Initialize model
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,  # Use 2 GPUs
    dtype="half"  # FP16 for efficiency
)

# Batch inference
prompts = ["Tell me about AI", "What is ML?"]
sampling_params = SamplingParams(
    temperature=0.8,
    max_tokens=100
)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Performance Benchmarks:

24x

Throughput vs Transformers

2.2x

vs TGI Performance

50%

Memory Reduction

# Production vLLM deployment with OpenAI API
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-13b-chat-hf \
    --tensor-parallel-size 4 \
    --dtype half \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.95 \
    --host 0.0.0.0 \
    --port 8000

# Client usage (drop-in OpenAI replacement)
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "EMPTY"

response = openai.ChatCompletion.create(
    model="meta-llama/Llama-2-13b-chat-hf",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    max_tokens=512,
    temperature=0.7
)
print(response.choices[0].message.content)

🎯 Ray Serve - Scalable ML Serving

What is Ray Serve?

Meaning: Scalable model serving library built on Ray, supporting complex ML pipelines and multi-model deployments.

Example: E-commerce platform serves ensemble of 5 models → Ray Serve handles load balancing → auto-scales from 10 to 1000 QPS.

Key Capabilities:

Framework Agnostic: PyTorch, TensorFlow, Scikit-learn
Composition: Chain multiple models
Auto-scaling: Dynamic replica management
Batching: Automatic request batching
A/B Testing: Traffic splitting

# Ray Serve deployment
import ray
from ray import serve
import torch

# Define deployment
@serve.deployment(
    num_replicas=3,
    ray_actor_options={"num_gpus": 1}
)
class ModelServer:
    def __init__(self):
        self.model = torch.load("model.pt")
    
    async def __call__(self, request):
        data = await request.json()
        with torch.no_grad():
            prediction = self.model(data["input"])
        return {"prediction": prediction.tolist()}

# Deploy
serve.run(ModelServer.bind())

# Advanced Ray Serve with model composition
from ray import serve
import asyncio

# Preprocessing deployment
@serve.deployment(num_replicas=2)
class Preprocessor:
    def preprocess(self, text):
        return text.lower().strip()

# Model deployment with auto-scaling
@serve.deployment(
    num_replicas=1,
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_num_ongoing_requests_per_replica": 5,
    },
    ray_actor_options={"num_gpus": 1},
)
class ModelServer:
    def __init__(self):
        import torch
        self.model = torch.load("sentiment_model.pt")
        self.model.eval()
    
    async def predict(self, text):
        # Model inference logic
        with torch.no_grad():
            prediction = self.model(text)
        return prediction.item()

# Pipeline composition
@serve.deployment
class Pipeline:
    def __init__(self, preprocessor, model):
        self.preprocessor = preprocessor
        self.model = model
    
    async def __call__(self, request):
        data = await request.json()
        
        # Preprocess
        clean_text = await self.preprocessor.preprocess.remote(
            data["text"]
        )
        
        # Predict
        prediction = await self.model.predict.remote(clean_text)
        
        return {"sentiment": prediction}

# Deploy pipeline
preprocessor = Preprocessor.bind()
model = ModelServer.bind()
pipeline = Pipeline.bind(preprocessor, model)

serve.run(pipeline, host="0.0.0.0", port=8000)

Advanced Features:

Multi-model ensembles with weighted voting
Real-time feature engineering pipelines
A/B testing with traffic splitting
Hybrid CPU-GPU workloads optimization

🚀 TensorRT - NVIDIA GPU Optimization

What is TensorRT?

Meaning: NVIDIA's high-performance deep learning inference library that optimizes models for deployment on NVIDIA GPUs.

Example: Computer vision model runs 5x faster after TensorRT optimization → latency drops from 50ms to 10ms on T4 GPU.

Optimization Techniques:

Layer Fusion: Combines ops to reduce memory bandwidth
Precision Calibration: INT8/FP16 quantization
Kernel Auto-tuning: Platform-specific optimization
Dynamic Tensor Memory: Efficient memory reuse
Multi-Stream Execution: Parallel inference

# TensorRT optimization
import tensorrt as trt
import torch
import torch2trt

# Convert PyTorch model to TensorRT
model = torchvision.models.resnet50(pretrained=True).eval().cuda()
x = torch.ones((1, 3, 224, 224)).cuda()

# Optimize with FP16
model_trt = torch2trt.torch2trt(
    model, 
    [x],
    fp16_mode=True,
    max_batch_size=32
)

# Benchmark
import time
warmup = 50
iterations = 1000

for _ in range(warmup):
    y_trt = model_trt(x)

t0 = time.time()
for _ in range(iterations):
    y_trt = model_trt(x)
t1 = time.time()

print(f"TensorRT FPS: {iterations / (t1 - t0):.1f}")

# TensorRT-LLM for large language models
import tensorrt_llm
from tensorrt_llm.runtime import Session, TensorInfo

# Build optimized engine
def build_engine(model_path, engine_path):
    from tensorrt_llm.builder import Builder
    
    builder = Builder()
    network = builder.create_network()
    
    # Configure optimization
    config = builder.create_builder_config(
        max_batch_size=32,
        max_seq_len=2048,
        precision='fp16',  # or 'int8' for quantization
        use_gpt_attention_plugin=True,
        use_gemm_plugin=True
    )
    
    # Build engine
    engine = builder.build_engine(network, config)
    engine.save(engine_path)
    
# Runtime inference
class TensorRTInference:
    def __init__(self, engine_path):
        self.session = Session.from_serialized_engine(
            engine_path
        )
    
    def generate(self, input_ids, max_new_tokens=100):
        inputs = {
            'input_ids': input_ids,
            'max_new_tokens': max_new_tokens,
            'temperature': 1.0,
            'top_p': 0.9
        }
        
        outputs = self.session.run(inputs)
        return outputs['output_ids']

# Usage
inference = TensorRTInference("llama_fp16.engine")
result = inference.generate(input_tokens)
print(f"Generated: {result}")

Performance Gains by Model:

5.4x

ResNet-50 INT8

4.2x

BERT FP16

70%

Latency Reduction

🔄 ONNX Runtime - Cross-Platform Inference

What is ONNX Runtime?

Meaning: Cross-platform, high-performance ML inference engine for ONNX (Open Neural Network Exchange) models.

Example: Same model runs on cloud (GPU), edge devices (CPU), and mobile (NPU) using ONNX Runtime → write once, deploy anywhere.

Key Features:

Hardware Agnostic: CPU, GPU, NPU, TPU support
Framework Interop: PyTorch, TensorFlow, Scikit-learn
Optimizations: Graph optimization, quantization
Language Bindings: Python, C++, C#, Java, JavaScript
Mobile Support: iOS, Android deployment

# ONNX Runtime inference
import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Prepare input
input_name = session.get_inputs()[0].name
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
result = session.run(None, {input_name: input_data})
output = result[0]

# Quantize for edge deployment
from onnxruntime.quantization import quantize_dynamic

quantize_dynamic(
    "model.onnx",
    "model_int8.onnx",
    weight_type=QuantType.QInt8
)

# Production ONNX Runtime deployment
import onnxruntime as ort
import numpy as np
from flask import Flask, request, jsonify

app = Flask(__name__)

# Global session with optimizations
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
session_options.intra_op_num_threads = 4

providers = [
    ('CUDAExecutionProvider', {
        'device_id': 0,
        'arena_extend_strategy': 'kNextPowerOfTwo',
        'gpu_mem_limit': 2 * 1024 * 1024 * 1024,  # 2GB
        'cudnn_conv_algo_search': 'EXHAUSTIVE',
    }),
    'CPUExecutionProvider'
]

session = ort.InferenceSession(
    "optimized_model.onnx",
    sess_options=session_options,
    providers=providers
)

# Model warmup
input_name = session.get_inputs()[0].name
warmup_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
for _ in range(10):
    session.run(None, {input_name: warmup_data})

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    input_tensor = np.array(data['input'], dtype=np.float32)
    
    # Run inference
    result = session.run(None, {input_name: input_tensor})
    
    return jsonify({
        'prediction': result[0].tolist(),
        'model_version': 'v1.0'
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000, threaded=True)

Deployment Scenarios:

Cloud: Azure ML, AWS SageMaker, GCP AI Platform
Edge: IoT devices, Raspberry Pi, NVIDIA Jetson
Mobile: iOS CoreML, Android NNAPI
Browser: ONNX.js, WebAssembly, WebGL

🔱 NVIDIA Triton - Multi-Framework Server

What is Triton Inference Server?

Meaning: Open-source inference serving software that standardizes AI model deployment across any infrastructure.

Example: Company serves PyTorch, TensorFlow, and ONNX models → single Triton server handles all → dynamic batching increases throughput 10x.

Key Features:

Multi-Framework: TensorFlow, PyTorch, ONNX, TensorRT
Dynamic Batching: Automatic request batching
Model Ensembles: Pipeline multiple models
Model Repository: Centralized model management
Concurrent Execution: Multiple model versions

# Triton model configuration
# config.pbtxt
name: "resnet50"
platform: "pytorch_libtorch"
max_batch_size: 32
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

# Dynamic batching configuration
dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 500
}

# Instance group for multi-GPU
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0, 1 ]
  }
]

# Python client for Triton
import tritonclient.http as httpclient
import numpy as np

# Create client
triton_client = httpclient.InferenceServerClient(
    url="localhost:8000"
)

# Check server health
print(triton_client.is_server_live())
print(triton_client.get_model_repository_index())

# Prepare inputs
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
inputs = []
inputs.append(httpclient.InferInput("INPUT__0", input_data.shape, "FP32"))
inputs[0].set_data_from_numpy(input_data)

# Prepare outputs
outputs = []
outputs.append(httpclient.InferRequestedOutput("OUTPUT__0"))

# Run inference
results = triton_client.infer(
    model_name="resnet50",
    inputs=inputs,
    outputs=outputs
)

# Get results
output_data = results.as_numpy("OUTPUT__0")
print(f"Prediction: {np.argmax(output_data)}")

Advanced Capabilities:

Model Ensembles: Chain preprocessing, inference, and postprocessing
Backend Support: Custom backends for any framework
Performance Analyzer: Built-in benchmarking tools
Kubernetes Native: Helm charts and operators

📊 Framework Comparison & Selection Guide

Comprehensive Framework Analysis

Detailed Framework Comparison:

Framework	Best Use Case	Performance	Hardware Support	Setup Complexity	Ecosystem
vLLM	Large Language Models	Excellent	NVIDIA GPUs	Low	OpenAI Compatible
Ray Serve	Complex ML Pipelines	Excellent	CPU/GPU Hybrid	Medium	Ray Ecosystem
TensorRT	Ultra-Low Latency	Outstanding	NVIDIA Only	High	NVIDIA Suite
ONNX Runtime	Cross-Platform Deployment	Good	Universal	Low	Microsoft Backed
Triton Server	Multi-Model Production	Excellent	Any	Medium	NVIDIA Enterprise

Performance Benchmark Summary:

Metric	vLLM	Ray Serve	TensorRT	ONNX Runtime	Triton
Throughput (RPS)	2000+ (LLMs)	10,000+	5000+	3000+	8000+
Latency (P95)	50-200ms	10-50ms	<10ms	20-100ms	<20ms
Memory Efficiency	Excellent	Good	Outstanding	Good	Excellent
Scaling	Vertical	Horizontal	Limited	Moderate	Horizontal

Decision Framework

Choose Based On:

Model Type: LLMs → vLLM, Computer Vision → TensorRT
Latency Requirements: <10ms → TensorRT, <100ms → Others
Throughput Needs: High → vLLM/Triton, Moderate → ONNX/Ray
Infrastructure: Multi-cloud → ONNX, NVIDIA → TensorRT/vLLM
Team Expertise: Simple → vLLM/ONNX, Complex → Ray/Triton

Production Integration Patterns:

# FastAPI + vLLM for LLM serving
from fastapi import FastAPI
from vllm import LLM

app = FastAPI()
llm = LLM(model="meta-llama/Llama-2-7b-hf")

@app.post("/generate")
async def generate(prompt: str):
    outputs = llm.generate([prompt])
    return {"response": outputs[0].outputs[0].text}

# Kubernetes + Triton for multi-model
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: triton-server
  template:
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-repository
          mountPath: /models

Performance Champions:

🏆 Throughput King: vLLM (24x faster than baseline)
⚡ Lowest Latency: TensorRT (70% latency reduction)
🔧 Most Flexible: Ray Serve (any model, any scale)
🌐 Best Compatibility: ONNX Runtime (runs everywhere)
🏢 Enterprise Ready: Triton Server (production features)

✅ Production Best Practices & Common Pitfalls

Production Deployment Strategy

🏗️ Architecture Patterns

Dedicated Model Server: Separate inference service with HTTP/gRPC API
Sidecar Pattern: Model container alongside application container
Gateway Pattern: Central router directing requests to multiple models
Ensemble Pattern: Multiple models with voting or averaging logic
Pipeline Pattern: Sequential model chain for complex workflows

⚠️ Production Checklist

✅ Performance Profiling: Identify bottlenecks before optimization
✅ Batch Size Optimization: Balance latency vs throughput
✅ Memory Management: Configure GPU memory pooling
✅ Request Queuing: Handle traffic spikes gracefully
✅ Health Monitoring: Comprehensive metrics and alerts
✅ Model Versioning: Blue-green deployment strategy
✅ Failover Testing: Graceful degradation on failures
✅ Security: Authentication, rate limiting, input validation
✅ Logging: Request tracing and audit trails

# Production monitoring setup
import prometheus_client
from prometheus_client import Counter, Histogram, Gauge
import time

# Metrics
REQUEST_COUNT = Counter(
    'inference_requests_total',
    'Total inference requests',
    ['model', 'status']
)
REQUEST_LATENCY = Histogram(
    'inference_duration_seconds',
    'Inference request duration',
    ['model']
)
GPU_MEMORY = Gauge(
    'gpu_memory_usage_bytes',
    'GPU memory usage',
    ['device']
)

# Monitoring decorator
def monitor_inference(model_name):
    def decorator(func):
        def wrapper(*args, **kwargs):
            start_time = time.time()
            status = 'success'
            
            try:
                result = func(*args, **kwargs)
                return result
            except Exception as e:
                status = 'error'
                raise
            finally:
                duration = time.time() - start_time
                REQUEST_COUNT.labels(model=model_name, status=status).inc()
                REQUEST_LATENCY.labels(model=model_name).observe(duration)
        
        return wrapper
    return decorator

# Usage
@monitor_inference('resnet50')
def predict_image(image_data):
    # Your inference logic
    pass

🚨 Common Pitfalls to Avoid

Cold Start Issues: Not warming up models before serving traffic
Memory Leaks: Improper tensor cleanup in long-running services
Premature Optimization: Optimizing without proper profiling
Missing Timeouts: No request timeout configuration
Stateful Dependencies: Models depending on external state
Insufficient Error Handling: Poor graceful degradation
Version Management: No rollback strategy for model updates
Resource Limits: Not setting proper CPU/GPU/memory limits

Model Serving Architecture Landscape

Core Serving Concepts

Key Performance Metrics:

Optimization Strategies:

⚡ vLLM - High-Performance LLM Serving

What is vLLM?

Key Features:

Performance Benchmarks:

🎯 Ray Serve - Scalable ML Serving

What is Ray Serve?

Key Capabilities:

Advanced Features:

🚀 TensorRT - NVIDIA GPU Optimization

What is TensorRT?

Optimization Techniques:

Performance Gains by Model:

🔄 ONNX Runtime - Cross-Platform Inference

What is ONNX Runtime?

Key Features:

Deployment Scenarios:

🔱 NVIDIA Triton - Multi-Framework Server

What is Triton Inference Server?

Key Features:

Advanced Capabilities:

📊 Framework Comparison & Selection Guide

Comprehensive Framework Analysis

Detailed Framework Comparison:

Performance Benchmark Summary:

Decision Framework

Choose Based On:

Production Integration Patterns:

Performance Champions:

✅ Production Best Practices & Common Pitfalls

Production Deployment Strategy

🏗️ Architecture Patterns

⚠️ Production Checklist

🚨 Common Pitfalls to Avoid

Scaling Strategies

Horizontal vs Vertical Scaling:

Auto-scaling Metrics:

🚀 Module 4: AI in Production Topics