Design a Model Serving Platform

Hard 30 min read

Problem Statement & Requirements

Why Model Serving Matters

A model is only valuable when it serves predictions. ChatGPT serves 100M+ users, Google runs billions of inferences per day. The serving platform must deliver predictions at low latency, high availability, while managing dozens of model versions simultaneously.

Think of model serving like a restaurant kitchen. The training pipeline is the recipe development lab. The serving platform is the actual kitchen that must prepare dishes (predictions) on demand, quickly and consistently, for thousands of customers simultaneously.

Functional Requirements

Non-Functional Requirements

Back-of-Envelope Estimation

ParameterEstimate
Models in production50-200
Total inference QPS500K (peak: 1.5M)
Avg model size500 MB (traditional) / 7-70 GB (LLM)
GPU memory per A10080 GB
Latency budget (traditional)<50ms p99
Latency budget (LLM)<500ms time-to-first-token
GPU fleet size100-500 GPUs

System API Design

Inference & Management APIs
# Synchronous inference
POST /api/v1/models/{model_name}/predict
{
  "inputs": [{ "features": [0.5, 1.2, 3.4] }],
  "params": { "version": "v3" }
}

# Streaming inference (LLMs)
POST /api/v1/models/{model_name}/generate
{
  "prompt": "Explain quantum computing",
  "max_tokens": 512,
  "stream": true
}

# Deploy a model version
POST /api/v1/deployments
{
  "model_name": "fraud_detector",
  "version": "v3.1",
  "traffic_percent": 10,
  "resources": { "gpu": "A100", "replicas": 4 }
}

# Update traffic split
PUT /api/v1/deployments/{model_name}/traffic
{ "v3.0": 50, "v3.1": 50 }

Data Model

Core Schema
CREATE TABLE models (
    model_name    VARCHAR PRIMARY KEY,
    description   TEXT,
    framework     VARCHAR,
    owner         VARCHAR
);
CREATE TABLE deployments (
    deployment_id VARCHAR PRIMARY KEY,
    model_name    VARCHAR,
    version       VARCHAR,
    status        VARCHAR,  -- deploying, active, draining
    replicas      INT,
    gpu_type      VARCHAR,
    traffic_pct   INT,
    created_at    TIMESTAMP
);
CREATE TABLE inference_logs (
    request_id    VARCHAR,
    model_name    VARCHAR,
    version       VARCHAR,
    latency_ms    FLOAT,
    status_code   INT,
    timestamp     TIMESTAMP
) PARTITION BY RANGE (timestamp);

High-Level Architecture

The serving platform follows a router → model server pattern with centralized control:

API Gateway / Load Balancer

Receives all inference requests. Handles authentication, rate limiting, and request validation. Routes to the appropriate model deployment based on traffic split rules.

Traffic Router

Implements A/B testing and canary logic. Deterministically assigns users to model versions using consistent hashing. Logs experiment assignments for analysis.

Model Servers

Run inference on GPU/CPU. Each server loads one model version. Implements request batching to maximize GPU utilization. Health checks report model readiness.

Deployment Controller

Manages rollouts: blue-green, canary, or rolling. Monitors health metrics and auto-rolls back on degradation. Scales replicas based on autoscaling policies.

Deep Dive: Core Components

Inference Modes

ModeLatencyUse CaseExample
Online (sync)<50msUser-facing requestsFraud scoring at checkout
StreamingTime-to-first-tokenLLM generationChatbot responses
BatchMinutes-hoursBulk scoringNightly recommendation refresh
AsyncSeconds-minutesNon-blocking tasksContent moderation queue

Request Batching

Dynamic Batching
class DynamicBatcher:
    def __init__(self, max_batch=32, max_wait_ms=5):
        self.queue = asyncio.Queue()
        self.max_batch = max_batch
        self.max_wait_ms = max_wait_ms

    async def process_loop(self):
        while True:
            batch = []
            deadline = time.time() + self.max_wait_ms / 1000

            # Collect requests until batch full or timeout
            while len(batch) < self.max_batch:
                remaining = deadline - time.time()
                if remaining <= 0:
                    break
                try:
                    req = await asyncio.wait_for(
                        self.queue.get(), timeout=remaining
                    )
                    batch.append(req)
                except asyncio.TimeoutError:
                    break

            if batch:
                # Run batch inference on GPU
                results = await self.model.predict_batch(batch)
                for req, result in zip(batch, results):
                    req.future.set_result(result)

Model Optimization Techniques

LLM-Specific: KV Cache & Speculative Decoding

LLM Serving Optimizations

KV Cache: Store key-value attention states to avoid recomputation during autoregressive generation. Trades memory for speed (can use 10-30 GB per request for long contexts).
PagedAttention (vLLM): Manages KV cache like virtual memory pages, eliminating fragmentation and enabling 2-4x higher throughput.
Speculative Decoding: A small draft model generates candidate tokens, the large model verifies them in parallel. 2-3x faster generation.

Scaling & Optimization

Auto-Scaling Policies

MetricScale Up WhenScale Down When
GPU Utilization>80% for 2 min<30% for 10 min
Request Queue Depth>50 pendingQueue empty for 5 min
P99 Latency>SLA threshold<50% of SLA

GPU Sharing

Small models (<2 GB) can share a single GPU via multi-process service (MPS) or time-slicing. Pack 4-8 lightweight models on one A100 to maximize utilization.

Practice Problems

Practice 1: Canary Rollout

You are deploying a new fraud model that is 5% more accurate but 20% slower. Design a canary rollout strategy that detects latency regressions before full deployment.

Practice 2: Multi-Model GPU Packing

You have 30 models ranging from 500MB to 40GB. Design a bin-packing algorithm to minimize the number of A100 GPUs (80GB each) needed while respecting latency SLAs.

Practice 3: Graceful Degradation

During a traffic spike, your GPU fleet is at 100% capacity. Design a degradation strategy: which requests get served by the full model, which get a lightweight fallback, and which are rejected?

Quick Reference

ComponentTechnologyPurpose
Model ServerTriton / vLLM / TorchServeGPU inference engine
Traffic RoutingIstio / EnvoyA/B testing, canary splits
Container OrchestrationKubernetes + GPU OperatorScheduling and scaling
OptimizationTensorRT / ONNX RuntimeModel compilation and acceleration
MonitoringPrometheus + GrafanaLatency, throughput, GPU metrics
LLM ServingvLLM / TGIPagedAttention, continuous batching

Key Takeaways

  • Use dynamic batching to maximize GPU utilization at the cost of slight latency increase
  • Quantization is the single biggest optimization for inference speed and cost
  • Always deploy with canary/shadow before full traffic cutover
  • LLM serving requires specialized techniques: KV cache, PagedAttention, speculative decoding
  • Design for graceful degradation — fallback models are better than errors