Serving Frameworks

Part of Module 4: AI in Production

⚡ vLLM - High-Performance LLM Serving

What is vLLM?

Meaning: Fast and easy-to-use library for LLM inference and serving, optimized for high throughput.
Example: Company serves GPT models 24x faster using vLLM's PagedAttention → reduces GPU memory usage by 50%.

Key Features:

  • PagedAttention: Efficient memory management
  • Continuous Batching: Dynamic request batching
  • Quantization: INT4/INT8 support
  • Tensor Parallelism: Multi-GPU serving
  • OpenAI Compatible: Drop-in replacement API
# vLLM server setup
from vllm import LLM, SamplingParams

# Initialize model
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,  # Use 2 GPUs
    dtype="half"  # FP16 for efficiency
)

# Batch inference
prompts = ["Tell me about AI", "What is ML?"]
sampling_params = SamplingParams(
    temperature=0.8,
    max_tokens=100
)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Performance Benchmarks:

  • 24x throughput vs HuggingFace Transformers
  • 2.2x throughput vs TGI (Text Generation Inference)
  • 50% less GPU memory with PagedAttention
  • Serves 100+ concurrent users on single A100

🎯 Ray Serve - Scalable ML Serving

What is Ray Serve?

Meaning: Scalable model serving library built on Ray, supporting complex ML pipelines and multi-model deployments.
Example: E-commerce platform serves ensemble of 5 models → Ray Serve handles load balancing → auto-scales from 10 to 1000 QPS.

Key Capabilities:

  • Framework Agnostic: PyTorch, TensorFlow, Scikit-learn
  • Composition: Chain multiple models
  • Auto-scaling: Dynamic replica management
  • Batching: Automatic request batching
  • A/B Testing: Traffic splitting
# Ray Serve deployment
import ray
from ray import serve
import torch

# Define deployment
@serve.deployment(
    num_replicas=3,
    ray_actor_options={"num_gpus": 1}
)
class ModelServer:
    def __init__(self):
        self.model = torch.load("model.pt")
    
    async def __call__(self, request):
        data = await request.json()
        with torch.no_grad():
            prediction = self.model(data["input"])
        return {"prediction": prediction.tolist()}

# Deploy
serve.run(ModelServer.bind())

Use Cases:

  • Multi-model ensembles
  • Real-time feature engineering
  • Model pipeline orchestration
  • Hybrid CPU-GPU workloads

🚀 TensorRT - NVIDIA GPU Optimization

What is TensorRT?

Meaning: NVIDIA's high-performance deep learning inference library that optimizes models for deployment on NVIDIA GPUs.
Example: Computer vision model runs 5x faster after TensorRT optimization → latency drops from 50ms to 10ms on T4 GPU.

Optimization Techniques:

  • Layer Fusion: Combines ops to reduce memory bandwidth
  • Precision Calibration: INT8/FP16 quantization
  • Kernel Auto-tuning: Platform-specific optimization
  • Dynamic Tensor Memory: Efficient memory reuse
  • Multi-Stream Execution: Parallel inference
# TensorRT optimization
import tensorrt as trt
import torch
import torch2trt

# Convert PyTorch model to TensorRT
model = torchvision.models.resnet50(pretrained=True).eval().cuda()
x = torch.ones((1, 3, 224, 224)).cuda()

# Optimize with FP16
model_trt = torch2trt.torch2trt(
    model, 
    [x],
    fp16_mode=True,
    max_batch_size=32
)

# Benchmark
import time
warmup = 50
iterations = 1000

for _ in range(warmup):
    y_trt = model_trt(x)

t0 = time.time()
for _ in range(iterations):
    y_trt = model_trt(x)
t1 = time.time()

print(f"TensorRT FPS: {iterations / (t1 - t0):.1f}")

Performance Gains:

  • ResNet-50: 5.4x speedup with INT8
  • BERT: 4.2x speedup with FP16
  • GPT-2: 3.7x speedup
  • 70% reduction in latency

🔄 ONNX Runtime - Cross-Platform Inference

What is ONNX Runtime?

Meaning: Cross-platform, high-performance ML inference engine for ONNX (Open Neural Network Exchange) models.
Example: Same model runs on cloud (GPU), edge devices (CPU), and mobile (NPU) using ONNX Runtime → write once, deploy anywhere.

Key Features:

  • Hardware Agnostic: CPU, GPU, NPU, TPU support
  • Framework Interop: PyTorch, TensorFlow, Scikit-learn
  • Optimizations: Graph optimization, quantization
  • Language Bindings: Python, C++, C#, Java, JavaScript
  • Mobile Support: iOS, Android deployment
# ONNX Runtime inference
import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Prepare input
input_name = session.get_inputs()[0].name
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
result = session.run(None, {input_name: input_data})
output = result[0]

# Quantize for edge deployment
from onnxruntime.quantization import quantize_dynamic

quantize_dynamic(
    "model.onnx",
    "model_int8.onnx",
    weight_type=QuantType.QInt8
)

Deployment Scenarios:

  • Cloud: Azure ML, AWS SageMaker
  • Edge: IoT devices, Raspberry Pi
  • Mobile: iOS CoreML, Android NNAPI
  • Browser: ONNX.js, WebAssembly

📊 Framework Comparison

Choosing the Right Framework

Decision Matrix:

Framework Best For Hardware Complexity
vLLM LLMs, Text Generation NVIDIA GPUs Low
Ray Serve Complex Pipelines CPU/GPU Mix Medium
TensorRT Low Latency NVIDIA Only High
ONNX Runtime Cross-Platform Any Low

Performance Benchmarks (Relative):

  • Throughput King: vLLM (for LLMs)
  • Lowest Latency: TensorRT
  • Most Flexible: Ray Serve
  • Best Compatibility: ONNX Runtime

Integration Examples:

  • FastAPI + vLLM: REST API for LLMs
  • Kubernetes + Ray Serve: Scalable ML platform
  • Triton + TensorRT: Multi-model server
  • Azure ML + ONNX: Cloud deployment

✅ Best Practices

Production Serving Guidelines

Architecture Patterns:

  • Model Server: Dedicated inference service
  • Sidecar Pattern: Model alongside application
  • Gateway Pattern: Central routing to models
  • Ensemble Pattern: Multiple models voting

Optimization Checklist:

  • ✅ Profile model bottlenecks first
  • ✅ Use appropriate batch sizes
  • ✅ Enable GPU memory pooling
  • ✅ Implement request queuing
  • ✅ Set up health checks and monitoring
  • ✅ Plan for model versioning
  • ✅ Test failover scenarios

Common Pitfalls:

  • Not warming up models before serving
  • Ignoring memory leaks in long-running services
  • Over-optimizing without profiling
  • Missing request timeout configuration
  • No strategy for model updates

Module 4: AI in Production Topics