Design a Model Serving Platform | LIZIU System Design

Problem Statement & Requirements

Why Model Serving Matters

A model is only valuable when it serves predictions. ChatGPT serves 100M+ users, Google runs billions of inferences per day. The serving platform must deliver predictions at low latency, high availability, while managing dozens of model versions simultaneously.

Think of model serving like a restaurant kitchen. The training pipeline is the recipe development lab. The serving platform is the actual kitchen that must prepare dishes (predictions) on demand, quickly and consistently, for thousands of customers simultaneously.

Functional Requirements

Model deployment — Deploy any model (PyTorch, TensorFlow, XGBoost, ONNX) with zero downtime
Version management — Multiple versions running simultaneously
A/B testing & canary — Route traffic percentages to different model versions
Auto-scaling — Scale GPU/CPU instances based on load
Monitoring — Track latency, throughput, error rates, and model-specific metrics
Rollback — Instant rollback to previous version on degradation

Non-Functional Requirements

Latency — <50ms p99 for lightweight models, <500ms for LLMs
Availability — 99.99% uptime with graceful degradation
Throughput — 100K+ inferences/second per model
Cost efficiency — Maximize GPU utilization (>70%)

Back-of-Envelope Estimation

Parameter	Estimate
Models in production	50-200
Total inference QPS	500K (peak: 1.5M)
Avg model size	500 MB (traditional) / 7-70 GB (LLM)
GPU memory per A100	80 GB
Latency budget (traditional)	<50ms p99
Latency budget (LLM)	<500ms time-to-first-token
GPU fleet size	100-500 GPUs

System API Design

Inference & Management APIs

# Synchronous inference
POST /api/v1/models/{model_name}/predict
{
  "inputs": [{ "features": [0.5, 1.2, 3.4] }],
  "params": { "version": "v3" }
}

# Streaming inference (LLMs)
POST /api/v1/models/{model_name}/generate
{
  "prompt": "Explain quantum computing",
  "max_tokens": 512,
  "stream": true
}

# Deploy a model version
POST /api/v1/deployments
{
  "model_name": "fraud_detector",
  "version": "v3.1",
  "traffic_percent": 10,
  "resources": { "gpu": "A100", "replicas": 4 }
}

# Update traffic split
PUT /api/v1/deployments/{model_name}/traffic
{ "v3.0": 50, "v3.1": 50 }

Data Model

Core Schema

CREATE TABLE models (
    model_name    VARCHAR PRIMARY KEY,
    description   TEXT,
    framework     VARCHAR,
    owner         VARCHAR
);
CREATE TABLE deployments (
    deployment_id VARCHAR PRIMARY KEY,
    model_name    VARCHAR,
    version       VARCHAR,
    status        VARCHAR,  -- deploying, active, draining
    replicas      INT,
    gpu_type      VARCHAR,
    traffic_pct   INT,
    created_at    TIMESTAMP
);
CREATE TABLE inference_logs (
    request_id    VARCHAR,
    model_name    VARCHAR,
    version       VARCHAR,
    latency_ms    FLOAT,
    status_code   INT,
    timestamp     TIMESTAMP
) PARTITION BY RANGE (timestamp);

High-Level Architecture

The serving platform follows a router → model server pattern with centralized control:

API Gateway / Load Balancer

Receives all inference requests. Handles authentication, rate limiting, and request validation. Routes to the appropriate model deployment based on traffic split rules.

Traffic Router

Implements A/B testing and canary logic. Deterministically assigns users to model versions using consistent hashing. Logs experiment assignments for analysis.

Model Servers

Run inference on GPU/CPU. Each server loads one model version. Implements request batching to maximize GPU utilization. Health checks report model readiness.

Deployment Controller

Manages rollouts: blue-green, canary, or rolling. Monitors health metrics and auto-rolls back on degradation. Scales replicas based on autoscaling policies.

Deep Dive: Core Components

Inference Modes

Mode	Latency	Use Case	Example
Online (sync)	<50ms	User-facing requests	Fraud scoring at checkout
Streaming	Time-to-first-token	LLM generation	Chatbot responses
Batch	Minutes-hours	Bulk scoring	Nightly recommendation refresh
Async	Seconds-minutes	Non-blocking tasks	Content moderation queue

Request Batching

Dynamic Batching

class DynamicBatcher:
    def __init__(self, max_batch=32, max_wait_ms=5):
        self.queue = asyncio.Queue()
        self.max_batch = max_batch
        self.max_wait_ms = max_wait_ms

    async def process_loop(self):
        while True:
            batch = []
            deadline = time.time() + self.max_wait_ms / 1000

            # Collect requests until batch full or timeout
            while len(batch) < self.max_batch:
                remaining = deadline - time.time()
                if remaining <= 0:
                    break
                try:
                    req = await asyncio.wait_for(
                        self.queue.get(), timeout=remaining
                    )
                    batch.append(req)
                except asyncio.TimeoutError:
                    break

            if batch:
                # Run batch inference on GPU
                results = await self.model.predict_batch(batch)
                for req, result in zip(batch, results):
                    req.future.set_result(result)

Model Optimization Techniques

Quantization: FP32 → INT8/INT4 reduces memory 4-8x, speeds inference 2-4x with minimal accuracy loss
Distillation: Train a smaller "student" model to mimic a larger "teacher" model
Pruning: Remove low-magnitude weights for sparser, faster computation
ONNX Runtime: Cross-framework optimization and hardware-specific acceleration

LLM-Specific: KV Cache & Speculative Decoding

LLM Serving Optimizations

KV Cache: Store key-value attention states to avoid recomputation during autoregressive generation. Trades memory for speed (can use 10-30 GB per request for long contexts).
PagedAttention (vLLM): Manages KV cache like virtual memory pages, eliminating fragmentation and enabling 2-4x higher throughput.
Speculative Decoding: A small draft model generates candidate tokens, the large model verifies them in parallel. 2-3x faster generation.

Scaling & Optimization

Auto-Scaling Policies

Metric	Scale Up When	Scale Down When
GPU Utilization	>80% for 2 min	<30% for 10 min
Request Queue Depth	>50 pending	Queue empty for 5 min
P99 Latency	>SLA threshold	<50% of SLA

GPU Sharing

Small models (<2 GB) can share a single GPU via multi-process service (MPS) or time-slicing. Pack 4-8 lightweight models on one A100 to maximize utilization.

Practice Problems

Practice 1: Canary Rollout

You are deploying a new fraud model that is 5% more accurate but 20% slower. Design a canary rollout strategy that detects latency regressions before full deployment.

Practice 2: Multi-Model GPU Packing

You have 30 models ranging from 500MB to 40GB. Design a bin-packing algorithm to minimize the number of A100 GPUs (80GB each) needed while respecting latency SLAs.

Practice 3: Graceful Degradation

During a traffic spike, your GPU fleet is at 100% capacity. Design a degradation strategy: which requests get served by the full model, which get a lightweight fallback, and which are rejected?

Quick Reference

Component	Technology	Purpose
Model Server	Triton / vLLM / TorchServe	GPU inference engine
Traffic Routing	Istio / Envoy	A/B testing, canary splits
Container Orchestration	Kubernetes + GPU Operator	Scheduling and scaling
Optimization	TensorRT / ONNX Runtime	Model compilation and acceleration
Monitoring	Prometheus + Grafana	Latency, throughput, GPU metrics
LLM Serving	vLLM / TGI	PagedAttention, continuous batching

Key Takeaways

Use dynamic batching to maximize GPU utilization at the cost of slight latency increase
Quantization is the single biggest optimization for inference speed and cost
Always deploy with canary/shadow before full traffic cutover
LLM serving requires specialized techniques: KV cache, PagedAttention, speculative decoding
Design for graceful degradation — fallback models are better than errors