Serving Frameworks

Part of Module 4: AI in Production

Model serving frameworks are the backbone of production AI systems, handling everything from inference optimization to scalable deployment. This comprehensive guide explores the leading frameworks, their unique strengths, and when to use each one for maximum performance and reliability in your ML pipeline.

Model Serving Architecture Landscape

Model Serving Architecture Patterns Client Applications Web Apps Mobile Apps APIs Microservices Batch Jobs Load Balancer / API Gateway Request Routing & Rate Limiting vLLM LLM Serving • PagedAttention • Continuous Batching • 24x Throughput • OpenAI Compatible Ray Serve Scalable ML • Multi-Model • Auto-scaling • A/B Testing • Pipeline Composition TensorRT GPU Optimization • Layer Fusion • INT8/FP16 • 5x Speedup • NVIDIA GPUs ONNX Runtime Cross-Platform • Any Hardware • Framework Agnostic • Mobile Ready • Edge Deployment Triton Server Multi-Framework • Model Ensembles • Dynamic Batching • Model Repository • Multi-GPU Infrastructure & Orchestration Kubernetes Docker Cloud GPU Monitoring Storage Networking Deployment Patterns Sidecar Pattern Model + App Same Container Dedicated Server Separate Service HTTP/gRPC API Gateway Pattern Central Router Multi-Model Ensemble Pattern Multiple Models Voting/Averaging

Core Serving Concepts

Model Serving transforms trained models into production-ready inference endpoints that handle real-world traffic at scale.

Key Performance Metrics:

  • Throughput: Requests processed per second (RPS/QPS)
  • Latency: Time from request to response (P50, P95, P99)
  • Resource Utilization: GPU/CPU/Memory efficiency
  • Scalability: Ability to handle traffic spikes

Optimization Strategies:

  • Batching: Process multiple requests together
  • Quantization: Reduce model precision (FP16, INT8)
  • Caching: Store frequent results
  • Model Compilation: Optimize computation graphs

⚡ vLLM - High-Performance LLM Serving

What is vLLM?

Meaning: Fast and easy-to-use library for LLM inference and serving, optimized for high throughput.
Example: Company serves GPT models 24x faster using vLLM's PagedAttention → reduces GPU memory usage by 50%.

Key Features:

  • PagedAttention: Efficient memory management
  • Continuous Batching: Dynamic request batching
  • Quantization: INT4/INT8 support
  • Tensor Parallelism: Multi-GPU serving
  • OpenAI Compatible: Drop-in replacement API
# vLLM server setup
from vllm import LLM, SamplingParams

# Initialize model
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,  # Use 2 GPUs
    dtype="half"  # FP16 for efficiency
)

# Batch inference
prompts = ["Tell me about AI", "What is ML?"]
sampling_params = SamplingParams(
    temperature=0.8,
    max_tokens=100
)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Performance Benchmarks:

24x
Throughput vs Transformers
2.2x
vs TGI Performance
50%
Memory Reduction
# Production vLLM deployment with OpenAI API
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-13b-chat-hf \
    --tensor-parallel-size 4 \
    --dtype half \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.95 \
    --host 0.0.0.0 \
    --port 8000

# Client usage (drop-in OpenAI replacement)
import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "EMPTY"

response = openai.ChatCompletion.create(
    model="meta-llama/Llama-2-13b-chat-hf",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    max_tokens=512,
    temperature=0.7
)
print(response.choices[0].message.content)

🎯 Ray Serve - Scalable ML Serving

What is Ray Serve?

Meaning: Scalable model serving library built on Ray, supporting complex ML pipelines and multi-model deployments.
Example: E-commerce platform serves ensemble of 5 models → Ray Serve handles load balancing → auto-scales from 10 to 1000 QPS.

Key Capabilities:

  • Framework Agnostic: PyTorch, TensorFlow, Scikit-learn
  • Composition: Chain multiple models
  • Auto-scaling: Dynamic replica management
  • Batching: Automatic request batching
  • A/B Testing: Traffic splitting
# Ray Serve deployment
import ray
from ray import serve
import torch

# Define deployment
@serve.deployment(
    num_replicas=3,
    ray_actor_options={"num_gpus": 1}
)
class ModelServer:
    def __init__(self):
        self.model = torch.load("model.pt")
    
    async def __call__(self, request):
        data = await request.json()
        with torch.no_grad():
            prediction = self.model(data["input"])
        return {"prediction": prediction.tolist()}

# Deploy
serve.run(ModelServer.bind())
# Advanced Ray Serve with model composition
from ray import serve
import asyncio

# Preprocessing deployment
@serve.deployment(num_replicas=2)
class Preprocessor:
    def preprocess(self, text):
        return text.lower().strip()

# Model deployment with auto-scaling
@serve.deployment(
    num_replicas=1,
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_num_ongoing_requests_per_replica": 5,
    },
    ray_actor_options={"num_gpus": 1},
)
class ModelServer:
    def __init__(self):
        import torch
        self.model = torch.load("sentiment_model.pt")
        self.model.eval()
    
    async def predict(self, text):
        # Model inference logic
        with torch.no_grad():
            prediction = self.model(text)
        return prediction.item()

# Pipeline composition
@serve.deployment
class Pipeline:
    def __init__(self, preprocessor, model):
        self.preprocessor = preprocessor
        self.model = model
    
    async def __call__(self, request):
        data = await request.json()
        
        # Preprocess
        clean_text = await self.preprocessor.preprocess.remote(
            data["text"]
        )
        
        # Predict
        prediction = await self.model.predict.remote(clean_text)
        
        return {"sentiment": prediction}

# Deploy pipeline
preprocessor = Preprocessor.bind()
model = ModelServer.bind()
pipeline = Pipeline.bind(preprocessor, model)

serve.run(pipeline, host="0.0.0.0", port=8000)

Advanced Features:

  • Multi-model ensembles with weighted voting
  • Real-time feature engineering pipelines
  • A/B testing with traffic splitting
  • Hybrid CPU-GPU workloads optimization

🚀 TensorRT - NVIDIA GPU Optimization

What is TensorRT?

Meaning: NVIDIA's high-performance deep learning inference library that optimizes models for deployment on NVIDIA GPUs.
Example: Computer vision model runs 5x faster after TensorRT optimization → latency drops from 50ms to 10ms on T4 GPU.

Optimization Techniques:

  • Layer Fusion: Combines ops to reduce memory bandwidth
  • Precision Calibration: INT8/FP16 quantization
  • Kernel Auto-tuning: Platform-specific optimization
  • Dynamic Tensor Memory: Efficient memory reuse
  • Multi-Stream Execution: Parallel inference
# TensorRT optimization
import tensorrt as trt
import torch
import torch2trt

# Convert PyTorch model to TensorRT
model = torchvision.models.resnet50(pretrained=True).eval().cuda()
x = torch.ones((1, 3, 224, 224)).cuda()

# Optimize with FP16
model_trt = torch2trt.torch2trt(
    model, 
    [x],
    fp16_mode=True,
    max_batch_size=32
)

# Benchmark
import time
warmup = 50
iterations = 1000

for _ in range(warmup):
    y_trt = model_trt(x)

t0 = time.time()
for _ in range(iterations):
    y_trt = model_trt(x)
t1 = time.time()

print(f"TensorRT FPS: {iterations / (t1 - t0):.1f}")
# TensorRT-LLM for large language models
import tensorrt_llm
from tensorrt_llm.runtime import Session, TensorInfo

# Build optimized engine
def build_engine(model_path, engine_path):
    from tensorrt_llm.builder import Builder
    
    builder = Builder()
    network = builder.create_network()
    
    # Configure optimization
    config = builder.create_builder_config(
        max_batch_size=32,
        max_seq_len=2048,
        precision='fp16',  # or 'int8' for quantization
        use_gpt_attention_plugin=True,
        use_gemm_plugin=True
    )
    
    # Build engine
    engine = builder.build_engine(network, config)
    engine.save(engine_path)
    
# Runtime inference
class TensorRTInference:
    def __init__(self, engine_path):
        self.session = Session.from_serialized_engine(
            engine_path
        )
    
    def generate(self, input_ids, max_new_tokens=100):
        inputs = {
            'input_ids': input_ids,
            'max_new_tokens': max_new_tokens,
            'temperature': 1.0,
            'top_p': 0.9
        }
        
        outputs = self.session.run(inputs)
        return outputs['output_ids']

# Usage
inference = TensorRTInference("llama_fp16.engine")
result = inference.generate(input_tokens)
print(f"Generated: {result}")

Performance Gains by Model:

5.4x
ResNet-50 INT8
4.2x
BERT FP16
70%
Latency Reduction

🔄 ONNX Runtime - Cross-Platform Inference

What is ONNX Runtime?

Meaning: Cross-platform, high-performance ML inference engine for ONNX (Open Neural Network Exchange) models.
Example: Same model runs on cloud (GPU), edge devices (CPU), and mobile (NPU) using ONNX Runtime → write once, deploy anywhere.

Key Features:

  • Hardware Agnostic: CPU, GPU, NPU, TPU support
  • Framework Interop: PyTorch, TensorFlow, Scikit-learn
  • Optimizations: Graph optimization, quantization
  • Language Bindings: Python, C++, C#, Java, JavaScript
  • Mobile Support: iOS, Android deployment
# ONNX Runtime inference
import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Prepare input
input_name = session.get_inputs()[0].name
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
result = session.run(None, {input_name: input_data})
output = result[0]

# Quantize for edge deployment
from onnxruntime.quantization import quantize_dynamic

quantize_dynamic(
    "model.onnx",
    "model_int8.onnx",
    weight_type=QuantType.QInt8
)
# Production ONNX Runtime deployment
import onnxruntime as ort
import numpy as np
from flask import Flask, request, jsonify

app = Flask(__name__)

# Global session with optimizations
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
session_options.intra_op_num_threads = 4

providers = [
    ('CUDAExecutionProvider', {
        'device_id': 0,
        'arena_extend_strategy': 'kNextPowerOfTwo',
        'gpu_mem_limit': 2 * 1024 * 1024 * 1024,  # 2GB
        'cudnn_conv_algo_search': 'EXHAUSTIVE',
    }),
    'CPUExecutionProvider'
]

session = ort.InferenceSession(
    "optimized_model.onnx",
    sess_options=session_options,
    providers=providers
)

# Model warmup
input_name = session.get_inputs()[0].name
warmup_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
for _ in range(10):
    session.run(None, {input_name: warmup_data})

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    input_tensor = np.array(data['input'], dtype=np.float32)
    
    # Run inference
    result = session.run(None, {input_name: input_tensor})
    
    return jsonify({
        'prediction': result[0].tolist(),
        'model_version': 'v1.0'
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000, threaded=True)

Deployment Scenarios:

  • Cloud: Azure ML, AWS SageMaker, GCP AI Platform
  • Edge: IoT devices, Raspberry Pi, NVIDIA Jetson
  • Mobile: iOS CoreML, Android NNAPI
  • Browser: ONNX.js, WebAssembly, WebGL

🔱 NVIDIA Triton - Multi-Framework Server

What is Triton Inference Server?

Meaning: Open-source inference serving software that standardizes AI model deployment across any infrastructure.
Example: Company serves PyTorch, TensorFlow, and ONNX models → single Triton server handles all → dynamic batching increases throughput 10x.

Key Features:

  • Multi-Framework: TensorFlow, PyTorch, ONNX, TensorRT
  • Dynamic Batching: Automatic request batching
  • Model Ensembles: Pipeline multiple models
  • Model Repository: Centralized model management
  • Concurrent Execution: Multiple model versions
# Triton model configuration
# config.pbtxt
name: "resnet50"
platform: "pytorch_libtorch"
max_batch_size: 32
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

# Dynamic batching configuration
dynamic_batching {
  preferred_batch_size: [ 4, 8, 16 ]
  max_queue_delay_microseconds: 500
}

# Instance group for multi-GPU
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0, 1 ]
  }
]
# Python client for Triton
import tritonclient.http as httpclient
import numpy as np

# Create client
triton_client = httpclient.InferenceServerClient(
    url="localhost:8000"
)

# Check server health
print(triton_client.is_server_live())
print(triton_client.get_model_repository_index())

# Prepare inputs
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
inputs = []
inputs.append(httpclient.InferInput("INPUT__0", input_data.shape, "FP32"))
inputs[0].set_data_from_numpy(input_data)

# Prepare outputs
outputs = []
outputs.append(httpclient.InferRequestedOutput("OUTPUT__0"))

# Run inference
results = triton_client.infer(
    model_name="resnet50",
    inputs=inputs,
    outputs=outputs
)

# Get results
output_data = results.as_numpy("OUTPUT__0")
print(f"Prediction: {np.argmax(output_data)}")

Advanced Capabilities:

  • Model Ensembles: Chain preprocessing, inference, and postprocessing
  • Backend Support: Custom backends for any framework
  • Performance Analyzer: Built-in benchmarking tools
  • Kubernetes Native: Helm charts and operators

📊 Framework Comparison & Selection Guide

Comprehensive Framework Analysis

Detailed Framework Comparison:

Framework Best Use Case Performance Hardware Support Setup Complexity Ecosystem
vLLM Large Language Models Excellent NVIDIA GPUs Low OpenAI Compatible
Ray Serve Complex ML Pipelines Excellent CPU/GPU Hybrid Medium Ray Ecosystem
TensorRT Ultra-Low Latency Outstanding NVIDIA Only High NVIDIA Suite
ONNX Runtime Cross-Platform Deployment Good Universal Low Microsoft Backed
Triton Server Multi-Model Production Excellent Any Medium NVIDIA Enterprise

Performance Benchmark Summary:

Metric vLLM Ray Serve TensorRT ONNX Runtime Triton
Throughput (RPS) 2000+ (LLMs) 10,000+ 5000+ 3000+ 8000+
Latency (P95) 50-200ms 10-50ms <10ms 20-100ms <20ms
Memory Efficiency Excellent Good Outstanding Good Excellent
Scaling Vertical Horizontal Limited Moderate Horizontal

Decision Framework

Choose Based On:

  • Model Type: LLMs → vLLM, Computer Vision → TensorRT
  • Latency Requirements: <10ms → TensorRT, <100ms → Others
  • Throughput Needs: High → vLLM/Triton, Moderate → ONNX/Ray
  • Infrastructure: Multi-cloud → ONNX, NVIDIA → TensorRT/vLLM
  • Team Expertise: Simple → vLLM/ONNX, Complex → Ray/Triton

Production Integration Patterns:

# FastAPI + vLLM for LLM serving
from fastapi import FastAPI
from vllm import LLM

app = FastAPI()
llm = LLM(model="meta-llama/Llama-2-7b-hf")

@app.post("/generate")
async def generate(prompt: str):
    outputs = llm.generate([prompt])
    return {"response": outputs[0].outputs[0].text}

# Kubernetes + Triton for multi-model
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: triton-server
  template:
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-repository
          mountPath: /models

Performance Champions:

  • 🏆 Throughput King: vLLM (24x faster than baseline)
  • ⚡ Lowest Latency: TensorRT (70% latency reduction)
  • 🔧 Most Flexible: Ray Serve (any model, any scale)
  • 🌐 Best Compatibility: ONNX Runtime (runs everywhere)
  • 🏢 Enterprise Ready: Triton Server (production features)

✅ Production Best Practices & Common Pitfalls

Production Deployment Strategy

🏗️ Architecture Patterns

  • Dedicated Model Server: Separate inference service with HTTP/gRPC API
  • Sidecar Pattern: Model container alongside application container
  • Gateway Pattern: Central router directing requests to multiple models
  • Ensemble Pattern: Multiple models with voting or averaging logic
  • Pipeline Pattern: Sequential model chain for complex workflows

⚠️ Production Checklist

  • Performance Profiling: Identify bottlenecks before optimization
  • Batch Size Optimization: Balance latency vs throughput
  • Memory Management: Configure GPU memory pooling
  • Request Queuing: Handle traffic spikes gracefully
  • Health Monitoring: Comprehensive metrics and alerts
  • Model Versioning: Blue-green deployment strategy
  • Failover Testing: Graceful degradation on failures
  • Security: Authentication, rate limiting, input validation
  • Logging: Request tracing and audit trails
# Production monitoring setup
import prometheus_client
from prometheus_client import Counter, Histogram, Gauge
import time

# Metrics
REQUEST_COUNT = Counter(
    'inference_requests_total',
    'Total inference requests',
    ['model', 'status']
)
REQUEST_LATENCY = Histogram(
    'inference_duration_seconds',
    'Inference request duration',
    ['model']
)
GPU_MEMORY = Gauge(
    'gpu_memory_usage_bytes',
    'GPU memory usage',
    ['device']
)

# Monitoring decorator
def monitor_inference(model_name):
    def decorator(func):
        def wrapper(*args, **kwargs):
            start_time = time.time()
            status = 'success'
            
            try:
                result = func(*args, **kwargs)
                return result
            except Exception as e:
                status = 'error'
                raise
            finally:
                duration = time.time() - start_time
                REQUEST_COUNT.labels(model=model_name, status=status).inc()
                REQUEST_LATENCY.labels(model=model_name).observe(duration)
        
        return wrapper
    return decorator

# Usage
@monitor_inference('resnet50')
def predict_image(image_data):
    # Your inference logic
    pass

🚨 Common Pitfalls to Avoid

  • Cold Start Issues: Not warming up models before serving traffic
  • Memory Leaks: Improper tensor cleanup in long-running services
  • Premature Optimization: Optimizing without proper profiling
  • Missing Timeouts: No request timeout configuration
  • Stateful Dependencies: Models depending on external state
  • Insufficient Error Handling: Poor graceful degradation
  • Version Management: No rollback strategy for model updates
  • Resource Limits: Not setting proper CPU/GPU/memory limits

Scaling Strategies

Horizontal vs Vertical Scaling:

  • Horizontal: Add more replicas (Ray Serve, Triton)
  • Vertical: Bigger GPUs, more memory (vLLM, TensorRT)
  • Hybrid: Combine both approaches for optimal cost/performance

Auto-scaling Metrics:

  • Request Queue Length: Scale up when queue grows
  • Response Time: P95 latency thresholds
  • Resource Utilization: GPU/CPU usage targets
  • Business Metrics: Revenue impact of latency

🚀 Module 4: AI in Production Topics