Rate Limiting & Throttling | LIZIU System Design

Why Rate Limiting Matters

Why This Matters

The Problem: Without rate limiting, a single user or bot can overwhelm your API with millions of requests, causing outages for everyone. DDoS attacks, buggy client loops, and viral traffic spikes can all take down services.

The Solution: Rate limiting controls how many requests a client can make in a given time window. Excess requests are rejected (HTTP 429) or queued.

Real Impact: GitHub limits API calls to 5,000/hour per authenticated user. Twitter limits tweet reads. Stripe limits payment API calls. Without these limits, a single misbehaving client could degrade service for millions.

Real-World Analogy

Think of rate limiting like a nightclub bouncer:

Token Bucket: The bouncer has a bucket of tokens. Each person entering takes one. Tokens refill at a steady rate. If the bucket is empty, people wait in line.
Fixed Window: The bouncer counts people entering per hour. At the start of each hour, the count resets to zero.
Sliding Window: The bouncer tracks the exact time each person entered and counts how many entered in the last 60 minutes.

Where to Apply Rate Limiting

API Gateway Level

Rate limit at the entry point before requests reach your services. Tools: NGINX, Kong, AWS API Gateway, Cloudflare.

Application Level

Fine-grained limits per endpoint, user tier, or operation. Custom logic for business rules (free vs paid tiers).

Database Level

Connection pooling and query rate limits to protect your database from being overwhelmed by application servers.

Client Side

Well-behaved clients implement exponential backoff and respect rate limit headers to avoid hammering the server.

Token Bucket Algorithm

The token bucket is the most widely used rate limiting algorithm. A bucket holds tokens that are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing controlled bursts.

Token Bucket Visualization

token_bucket.py

# Token Bucket Rate Limiter
import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        """
        capacity: max tokens the bucket can hold (burst size)
        refill_rate: tokens added per second
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = capacity  # Start full
        self.last_refill = time.time()

    def _refill(self):
        """Add tokens based on elapsed time."""
        now = time.time()
        elapsed = now - self.last_refill
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(
            self.capacity,
            self.tokens + new_tokens
        )
        self.last_refill = now

    def allow_request(self, tokens_needed=1) -> bool:
        """Check if request is allowed."""
        self._refill()
        if self.tokens >= tokens_needed:
            self.tokens -= tokens_needed
            return True
        return False

    def wait_time(self) -> float:
        """How long until next token is available."""
        if self.tokens >= 1:
            return 0.0
        return (1 - self.tokens) / self.refill_rate

# Usage: 10 requests/second, burst of 20
limiter = TokenBucket(capacity=20, refill_rate=10)

for i in range(25):
    if limiter.allow_request():
        print(f"Request {i+1}: Allowed")
    else:
        print(f"Request {i+1}: Rejected (retry in {limiter.wait_time():.2f}s)")

Sliding Window Algorithm

The sliding window counter combines the simplicity of fixed windows with the accuracy of a true sliding window. It uses weighted counts from the current and previous window to approximate the request count.

Sliding Window Counter

Fixed Window vs Sliding Window

Algorithm	Memory	Accuracy	Burst Handling	Implementation
Token Bucket	O(1) per user	Good	Allows controlled bursts up to capacity	Simple; used by AWS, Stripe
Leaky Bucket	O(1) per user	Good	Smooths bursts to fixed rate	Like token bucket but processes at fixed rate
Fixed Window	O(1) per user	Low	2x burst at window boundary	Simple counter + reset
Sliding Window Log	O(N) per user	Perfect	No burst issues	Store all timestamps; expensive
Sliding Window Counter	O(1) per user	Very good	Minimal boundary issues	Weighted average; best trade-off

Fixed Window Boundary Problem

Problem: With a limit of 100 requests per minute, a user can send 100 requests at 12:00:59 and another 100 at 12:01:00, getting 200 requests through in 2 seconds!

Solution: Use sliding window counters or token buckets instead. They prevent this burst-at-boundary attack.

Distributed Rate Limiting

When your API runs on multiple servers, each server needs to share rate limit state. A centralized store like Redis is the standard approach.

redis_rate_limiter.py

# Distributed rate limiter using Redis
import redis
import time

class RedisRateLimiter:
    def __init__(self, redis_client, limit: int,
                 window_seconds: int):
        """
        limit: max requests per window
        window_seconds: window size in seconds
        """
        self.redis = redis_client
        self.limit = limit
        self.window = window_seconds

    def is_allowed(self, client_id: str) -> dict:
        """Sliding window counter using Redis."""
        now = time.time()
        window_key = f"rate:{client_id}"

        # Use a Redis pipeline for atomicity
        pipe = self.redis.pipeline()

        # Remove old entries outside the window
        pipe.zremrangebyscore(window_key, 0, now - self.window)

        # Count current requests in window
        pipe.zcard(window_key)

        # Add current request with timestamp as score
        pipe.zadd(window_key, {f"{now}": now})

        # Set TTL so keys auto-expire
        pipe.expire(window_key, self.window)

        results = pipe.execute()
        request_count = results[1]  # zcard result

        allowed = request_count < self.limit
        remaining = max(0, self.limit - request_count - 1)

        return {
            "allowed": allowed,
            "limit": self.limit,
            "remaining": remaining if allowed else 0,
            "retry_after": self.window if not allowed else 0,
        }

# Usage
r = redis.Redis(host="localhost", port=6379)
limiter = RedisRateLimiter(r, limit=100, window_seconds=60)

result = limiter.is_allowed("user:42")
if result["allowed"]:
    print(f"Request allowed. {result['remaining']} remaining.")
else:
    print(f"Rate limited. Retry after {result['retry_after']}s.")

Rate Limiting HTTP Headers

Standard Rate Limit Headers (RFC 6585 / draft-ietf-httpapi-ratelimit-headers)

X-RateLimit-Limit: Maximum requests allowed in the window (e.g., 100)
X-RateLimit-Remaining: Requests remaining in current window (e.g., 57)
X-RateLimit-Reset: Unix timestamp when the window resets (e.g., 1640000000)
Retry-After: Seconds to wait before retrying (sent with 429 responses)

rate_limit_middleware.py

# Flask middleware that adds rate limit headers
from flask import Flask, request, jsonify, g
from functools import wraps

app = Flask(__name__)

def rate_limit(limit=100, window=60):
    """Decorator to rate limit an endpoint."""
    def decorator(f):
        @wraps(f)
        def wrapped(*args, **kwargs):
            client_id = request.headers.get(
                "X-API-Key", request.remote_addr
            )
            result = limiter.is_allowed(client_id)

            # Always set rate limit headers
            headers = {
                "X-RateLimit-Limit": str(limit),
                "X-RateLimit-Remaining": str(result["remaining"]),
                "X-RateLimit-Reset": str(
                    int(time.time()) + window
                ),
            }

            if not result["allowed"]:
                headers["Retry-After"] = str(window)
                return jsonify({
                    "error": "Rate limit exceeded",
                    "retry_after": window
                }), 429, headers

            response = f(*args, **kwargs)
            return response, 200, headers
        return wrapped
    return decorator

@app.route("/api/data")
@rate_limit(limit=100, window=60)
def get_data():
    return jsonify({"data": "here"})

Practice Problems

Medium Design a Tiered Rate Limiter

Design a rate limiter that supports different tiers:

Free tier: 10 requests/minute
Basic tier: 100 requests/minute
Premium tier: 1000 requests/minute
Implement graceful degradation (queue excess requests for premium users instead of rejecting)

Use a dictionary mapping user tiers to token bucket configurations. For premium queuing, use a separate queue with a maximum size.

from collections import deque

TIER_LIMITS = {
    "free":    {"capacity": 10,   "rate": 10/60},
    "basic":   {"capacity": 100,  "rate": 100/60},
    "premium": {"capacity": 1000, "rate": 1000/60},
}

class TieredRateLimiter:
    def __init__(self):
        self.buckets = {}
        self.queues = {}  # For premium overflow

    def _get_bucket(self, user_id, tier):
        if user_id not in self.buckets:
            cfg = TIER_LIMITS[tier]
            self.buckets[user_id] = TokenBucket(
                cfg["capacity"], cfg["rate"])
        return self.buckets[user_id]

    def handle_request(self, user_id, tier):
        bucket = self._get_bucket(user_id, tier)
        if bucket.allow_request():
            return {"status": "allowed"}

        # Premium users get queued
        if tier == "premium":
            q = self.queues.setdefault(
                user_id, deque(maxlen=50))
            if len(q) < 50:
                q.append(time.time())
                return {"status": "queued",
                        "position": len(q)}

        return {"status": "rejected",
                "retry_after": bucket.wait_time()}

Medium Sliding Window Log

Implement a sliding window log rate limiter that:

Stores exact timestamps of each request
Counts requests in the last N seconds
Cleans up old timestamps to bound memory usage

Use a sorted list or deque of timestamps. On each request, remove timestamps older than the window, then check if count is under the limit.

from collections import deque

class SlidingWindowLog:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window = window_seconds
        self.timestamps = deque()

    def allow_request(self) -> bool:
        now = time.time()
        cutoff = now - self.window

        # Remove expired timestamps
        while (self.timestamps and
               self.timestamps[0] <= cutoff):
            self.timestamps.popleft()

        if len(self.timestamps) < self.limit:
            self.timestamps.append(now)
            return True
        return False

# Pros: Perfect accuracy
# Cons: O(N) memory per user where N = limit

Hard Distributed Rate Limiter Design

Design a rate limiting system for a global API serving 1 million requests per second across 5 data centers. Consider:

How to share rate limit state across data centers
What happens if the Redis cluster is temporarily unavailable
How to handle clock skew between servers

Consider local rate limiting with periodic sync. Use a "fail-open" policy for Redis outages. NTP keeps clocks within milliseconds.

# Distributed Rate Limiter Architecture

# 1. Local + Global approach
# - Each DC has local Redis for fast checks
# - Global Redis syncs counts periodically
# - Local limit = global_limit / num_DCs
# - Allows 20% overflow before global sync

# 2. Fail-open policy
# - If Redis is down, allow requests but log
# - Switch to in-memory local rate limiting
# - Alert ops team immediately

# 3. Clock skew mitigation
# - Use Redis server time (TIME command)
# - NTP keeps servers within ~1ms
# - Window granularity >> clock skew

class DistributedRateLimiter:
    def __init__(self, local_redis, global_redis,
                 global_limit, num_dcs=5):
        self.local = local_redis
        self.global_r = global_redis
        self.local_limit = global_limit // num_dcs
        self.overflow = int(self.local_limit * 0.2)

    def is_allowed(self, client_id):
        try:
            count = self.local.incr(
                f"rl:{client_id}")
            if count == 1:
                self.local.expire(
                    f"rl:{client_id}", 60)
            return count <= self.local_limit
        except Exception:
            # Fail open: allow but log
            return True

Quick Reference

Algorithm Selection Guide

Use Case	Recommended Algorithm	Why
General API rate limiting	Token Bucket	Simple, allows bursts, O(1) memory
Strict uniform rate	Leaky Bucket	Smooths traffic to constant rate
Simple counter per hour	Fixed Window	Easy to implement, low overhead
Accurate window-based	Sliding Window Counter	Best accuracy-to-memory ratio
Audit / compliance	Sliding Window Log	Perfect accuracy, full log of requests

Key Design Considerations

Rate Limiting Checklist

Identify the key: IP address, API key, user ID, or combination
Choose the scope: Per endpoint, per user, global
Set appropriate limits: Based on service capacity, not arbitrary numbers
Return helpful headers: X-RateLimit-Limit, Remaining, Reset, Retry-After
Handle distributed state: Redis or similar for multi-server setups
Fail-open vs fail-closed: What happens when the rate limiter itself is down?
Monitoring: Alert on high rejection rates (may indicate an attack or misconfiguration)