Problem Statement & Requirements
Why Model Serving Matters
A model is only valuable when it serves predictions. ChatGPT serves 100M+ users, Google runs billions of inferences per day. The serving platform must deliver predictions at low latency, high availability, while managing dozens of model versions simultaneously.
Think of model serving like a restaurant kitchen. The training pipeline is the recipe development lab. The serving platform is the actual kitchen that must prepare dishes (predictions) on demand, quickly and consistently, for thousands of customers simultaneously.
Functional Requirements
- Model deployment — Deploy any model (PyTorch, TensorFlow, XGBoost, ONNX) with zero downtime
- Version management — Multiple versions running simultaneously
- A/B testing & canary — Route traffic percentages to different model versions
- Auto-scaling — Scale GPU/CPU instances based on load
- Monitoring — Track latency, throughput, error rates, and model-specific metrics
- Rollback — Instant rollback to previous version on degradation
Non-Functional Requirements
- Latency — <50ms p99 for lightweight models, <500ms for LLMs
- Availability — 99.99% uptime with graceful degradation
- Throughput — 100K+ inferences/second per model
- Cost efficiency — Maximize GPU utilization (>70%)
Back-of-Envelope Estimation
| Parameter | Estimate |
|---|---|
| Models in production | 50-200 |
| Total inference QPS | 500K (peak: 1.5M) |
| Avg model size | 500 MB (traditional) / 7-70 GB (LLM) |
| GPU memory per A100 | 80 GB |
| Latency budget (traditional) | <50ms p99 |
| Latency budget (LLM) | <500ms time-to-first-token |
| GPU fleet size | 100-500 GPUs |
System API Design
# Synchronous inference
POST /api/v1/models/{model_name}/predict
{
"inputs": [{ "features": [0.5, 1.2, 3.4] }],
"params": { "version": "v3" }
}
# Streaming inference (LLMs)
POST /api/v1/models/{model_name}/generate
{
"prompt": "Explain quantum computing",
"max_tokens": 512,
"stream": true
}
# Deploy a model version
POST /api/v1/deployments
{
"model_name": "fraud_detector",
"version": "v3.1",
"traffic_percent": 10,
"resources": { "gpu": "A100", "replicas": 4 }
}
# Update traffic split
PUT /api/v1/deployments/{model_name}/traffic
{ "v3.0": 50, "v3.1": 50 }
Data Model
CREATE TABLE models (
model_name VARCHAR PRIMARY KEY,
description TEXT,
framework VARCHAR,
owner VARCHAR
);
CREATE TABLE deployments (
deployment_id VARCHAR PRIMARY KEY,
model_name VARCHAR,
version VARCHAR,
status VARCHAR, -- deploying, active, draining
replicas INT,
gpu_type VARCHAR,
traffic_pct INT,
created_at TIMESTAMP
);
CREATE TABLE inference_logs (
request_id VARCHAR,
model_name VARCHAR,
version VARCHAR,
latency_ms FLOAT,
status_code INT,
timestamp TIMESTAMP
) PARTITION BY RANGE (timestamp);
High-Level Architecture
The serving platform follows a router → model server pattern with centralized control:
API Gateway / Load Balancer
Receives all inference requests. Handles authentication, rate limiting, and request validation. Routes to the appropriate model deployment based on traffic split rules.
Traffic Router
Implements A/B testing and canary logic. Deterministically assigns users to model versions using consistent hashing. Logs experiment assignments for analysis.
Model Servers
Run inference on GPU/CPU. Each server loads one model version. Implements request batching to maximize GPU utilization. Health checks report model readiness.
Deployment Controller
Manages rollouts: blue-green, canary, or rolling. Monitors health metrics and auto-rolls back on degradation. Scales replicas based on autoscaling policies.
Deep Dive: Core Components
Inference Modes
| Mode | Latency | Use Case | Example |
|---|---|---|---|
| Online (sync) | <50ms | User-facing requests | Fraud scoring at checkout |
| Streaming | Time-to-first-token | LLM generation | Chatbot responses |
| Batch | Minutes-hours | Bulk scoring | Nightly recommendation refresh |
| Async | Seconds-minutes | Non-blocking tasks | Content moderation queue |
Request Batching
class DynamicBatcher:
def __init__(self, max_batch=32, max_wait_ms=5):
self.queue = asyncio.Queue()
self.max_batch = max_batch
self.max_wait_ms = max_wait_ms
async def process_loop(self):
while True:
batch = []
deadline = time.time() + self.max_wait_ms / 1000
# Collect requests until batch full or timeout
while len(batch) < self.max_batch:
remaining = deadline - time.time()
if remaining <= 0:
break
try:
req = await asyncio.wait_for(
self.queue.get(), timeout=remaining
)
batch.append(req)
except asyncio.TimeoutError:
break
if batch:
# Run batch inference on GPU
results = await self.model.predict_batch(batch)
for req, result in zip(batch, results):
req.future.set_result(result)
Model Optimization Techniques
- Quantization: FP32 → INT8/INT4 reduces memory 4-8x, speeds inference 2-4x with minimal accuracy loss
- Distillation: Train a smaller "student" model to mimic a larger "teacher" model
- Pruning: Remove low-magnitude weights for sparser, faster computation
- ONNX Runtime: Cross-framework optimization and hardware-specific acceleration
LLM-Specific: KV Cache & Speculative Decoding
LLM Serving Optimizations
KV Cache: Store key-value attention states to avoid recomputation during autoregressive generation. Trades memory for speed (can use 10-30 GB per request for long contexts).
PagedAttention (vLLM): Manages KV cache like virtual memory pages, eliminating fragmentation and enabling 2-4x higher throughput.
Speculative Decoding: A small draft model generates candidate tokens, the large model verifies them in parallel. 2-3x faster generation.
Scaling & Optimization
Auto-Scaling Policies
| Metric | Scale Up When | Scale Down When |
|---|---|---|
| GPU Utilization | >80% for 2 min | <30% for 10 min |
| Request Queue Depth | >50 pending | Queue empty for 5 min |
| P99 Latency | >SLA threshold | <50% of SLA |
GPU Sharing
Small models (<2 GB) can share a single GPU via multi-process service (MPS) or time-slicing. Pack 4-8 lightweight models on one A100 to maximize utilization.
Practice Problems
Practice 1: Canary Rollout
You are deploying a new fraud model that is 5% more accurate but 20% slower. Design a canary rollout strategy that detects latency regressions before full deployment.
Practice 2: Multi-Model GPU Packing
You have 30 models ranging from 500MB to 40GB. Design a bin-packing algorithm to minimize the number of A100 GPUs (80GB each) needed while respecting latency SLAs.
Practice 3: Graceful Degradation
During a traffic spike, your GPU fleet is at 100% capacity. Design a degradation strategy: which requests get served by the full model, which get a lightweight fallback, and which are rejected?
Quick Reference
| Component | Technology | Purpose |
|---|---|---|
| Model Server | Triton / vLLM / TorchServe | GPU inference engine |
| Traffic Routing | Istio / Envoy | A/B testing, canary splits |
| Container Orchestration | Kubernetes + GPU Operator | Scheduling and scaling |
| Optimization | TensorRT / ONNX Runtime | Model compilation and acceleration |
| Monitoring | Prometheus + Grafana | Latency, throughput, GPU metrics |
| LLM Serving | vLLM / TGI | PagedAttention, continuous batching |
Key Takeaways
- Use dynamic batching to maximize GPU utilization at the cost of slight latency increase
- Quantization is the single biggest optimization for inference speed and cost
- Always deploy with canary/shadow before full traffic cutover
- LLM serving requires specialized techniques: KV cache, PagedAttention, speculative decoding
- Design for graceful degradation — fallback models are better than errors