Rate Limiting & Throttling

Medium 22 min read

Why Rate Limiting Matters

Why This Matters

The Problem: Without rate limiting, a single user or bot can overwhelm your API with millions of requests, causing outages for everyone. DDoS attacks, buggy client loops, and viral traffic spikes can all take down services.

The Solution: Rate limiting controls how many requests a client can make in a given time window. Excess requests are rejected (HTTP 429) or queued.

Real Impact: GitHub limits API calls to 5,000/hour per authenticated user. Twitter limits tweet reads. Stripe limits payment API calls. Without these limits, a single misbehaving client could degrade service for millions.

Real-World Analogy

Think of rate limiting like a nightclub bouncer:

  • Token Bucket: The bouncer has a bucket of tokens. Each person entering takes one. Tokens refill at a steady rate. If the bucket is empty, people wait in line.
  • Fixed Window: The bouncer counts people entering per hour. At the start of each hour, the count resets to zero.
  • Sliding Window: The bouncer tracks the exact time each person entered and counts how many entered in the last 60 minutes.

Where to Apply Rate Limiting

API Gateway Level

Rate limit at the entry point before requests reach your services. Tools: NGINX, Kong, AWS API Gateway, Cloudflare.

Application Level

Fine-grained limits per endpoint, user tier, or operation. Custom logic for business rules (free vs paid tiers).

Database Level

Connection pooling and query rate limits to protect your database from being overwhelmed by application servers.

Client Side

Well-behaved clients implement exponential backoff and respect rate limit headers to avoid hammering the server.

Token Bucket Algorithm

The token bucket is the most widely used rate limiting algorithm. A bucket holds tokens that are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing controlled bursts.

Token Bucket Visualization
Token Bucket (capacity=5) T T T +1 token/second (refill rate) Request takes 1 token Tokens available? 200 OK - Allowed Bucket empty? 429 Too Many Requests Burst capacity: 5 requests at once Sustained rate: 1 req/sec Memory: O(1) per user
token_bucket.py
# Token Bucket Rate Limiter
import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        """
        capacity: max tokens the bucket can hold (burst size)
        refill_rate: tokens added per second
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = capacity  # Start full
        self.last_refill = time.time()

    def _refill(self):
        """Add tokens based on elapsed time."""
        now = time.time()
        elapsed = now - self.last_refill
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(
            self.capacity,
            self.tokens + new_tokens
        )
        self.last_refill = now

    def allow_request(self, tokens_needed=1) -> bool:
        """Check if request is allowed."""
        self._refill()
        if self.tokens >= tokens_needed:
            self.tokens -= tokens_needed
            return True
        return False

    def wait_time(self) -> float:
        """How long until next token is available."""
        if self.tokens >= 1:
            return 0.0
        return (1 - self.tokens) / self.refill_rate

# Usage: 10 requests/second, burst of 20
limiter = TokenBucket(capacity=20, refill_rate=10)

for i in range(25):
    if limiter.allow_request():
        print(f"Request {i+1}: Allowed")
    else:
        print(f"Request {i+1}: Rejected (retry in {limiter.wait_time():.2f}s)")

Sliding Window Algorithm

The sliding window counter combines the simplicity of fixed windows with the accuracy of a true sliding window. It uses weighted counts from the current and previous window to approximate the request count.

Sliding Window Counter
Previous Window 42 requests 12:00 - 12:59 Current Window 18 requests 13:00 - 13:59 12:00 13:00 NOW (13:15) 14:00 Sliding Window (60 min) Weighted count = 42 x 0.75 (overlap) + 18 = 49.5 75% of previous window overlaps with sliding window (45 min / 60 min)

Fixed Window vs Sliding Window

Algorithm Memory Accuracy Burst Handling Implementation
Token Bucket O(1) per user Good Allows controlled bursts up to capacity Simple; used by AWS, Stripe
Leaky Bucket O(1) per user Good Smooths bursts to fixed rate Like token bucket but processes at fixed rate
Fixed Window O(1) per user Low 2x burst at window boundary Simple counter + reset
Sliding Window Log O(N) per user Perfect No burst issues Store all timestamps; expensive
Sliding Window Counter O(1) per user Very good Minimal boundary issues Weighted average; best trade-off

Fixed Window Boundary Problem

Problem: With a limit of 100 requests per minute, a user can send 100 requests at 12:00:59 and another 100 at 12:01:00, getting 200 requests through in 2 seconds!

Solution: Use sliding window counters or token buckets instead. They prevent this burst-at-boundary attack.

Distributed Rate Limiting

When your API runs on multiple servers, each server needs to share rate limit state. A centralized store like Redis is the standard approach.

redis_rate_limiter.py
# Distributed rate limiter using Redis
import redis
import time

class RedisRateLimiter:
    def __init__(self, redis_client, limit: int,
                 window_seconds: int):
        """
        limit: max requests per window
        window_seconds: window size in seconds
        """
        self.redis = redis_client
        self.limit = limit
        self.window = window_seconds

    def is_allowed(self, client_id: str) -> dict:
        """Sliding window counter using Redis."""
        now = time.time()
        window_key = f"rate:{client_id}"

        # Use a Redis pipeline for atomicity
        pipe = self.redis.pipeline()

        # Remove old entries outside the window
        pipe.zremrangebyscore(window_key, 0, now - self.window)

        # Count current requests in window
        pipe.zcard(window_key)

        # Add current request with timestamp as score
        pipe.zadd(window_key, {f"{now}": now})

        # Set TTL so keys auto-expire
        pipe.expire(window_key, self.window)

        results = pipe.execute()
        request_count = results[1]  # zcard result

        allowed = request_count < self.limit
        remaining = max(0, self.limit - request_count - 1)

        return {
            "allowed": allowed,
            "limit": self.limit,
            "remaining": remaining if allowed else 0,
            "retry_after": self.window if not allowed else 0,
        }

# Usage
r = redis.Redis(host="localhost", port=6379)
limiter = RedisRateLimiter(r, limit=100, window_seconds=60)

result = limiter.is_allowed("user:42")
if result["allowed"]:
    print(f"Request allowed. {result['remaining']} remaining.")
else:
    print(f"Rate limited. Retry after {result['retry_after']}s.")

Rate Limiting HTTP Headers

Standard Rate Limit Headers (RFC 6585 / draft-ietf-httpapi-ratelimit-headers)

  • X-RateLimit-Limit: Maximum requests allowed in the window (e.g., 100)
  • X-RateLimit-Remaining: Requests remaining in current window (e.g., 57)
  • X-RateLimit-Reset: Unix timestamp when the window resets (e.g., 1640000000)
  • Retry-After: Seconds to wait before retrying (sent with 429 responses)
rate_limit_middleware.py
# Flask middleware that adds rate limit headers
from flask import Flask, request, jsonify, g
from functools import wraps

app = Flask(__name__)

def rate_limit(limit=100, window=60):
    """Decorator to rate limit an endpoint."""
    def decorator(f):
        @wraps(f)
        def wrapped(*args, **kwargs):
            client_id = request.headers.get(
                "X-API-Key", request.remote_addr
            )
            result = limiter.is_allowed(client_id)

            # Always set rate limit headers
            headers = {
                "X-RateLimit-Limit": str(limit),
                "X-RateLimit-Remaining": str(result["remaining"]),
                "X-RateLimit-Reset": str(
                    int(time.time()) + window
                ),
            }

            if not result["allowed"]:
                headers["Retry-After"] = str(window)
                return jsonify({
                    "error": "Rate limit exceeded",
                    "retry_after": window
                }), 429, headers

            response = f(*args, **kwargs)
            return response, 200, headers
        return wrapped
    return decorator

@app.route("/api/data")
@rate_limit(limit=100, window=60)
def get_data():
    return jsonify({"data": "here"})

Practice Problems

Medium Design a Tiered Rate Limiter

Design a rate limiter that supports different tiers:

  1. Free tier: 10 requests/minute
  2. Basic tier: 100 requests/minute
  3. Premium tier: 1000 requests/minute
  4. Implement graceful degradation (queue excess requests for premium users instead of rejecting)

Use a dictionary mapping user tiers to token bucket configurations. For premium queuing, use a separate queue with a maximum size.

from collections import deque

TIER_LIMITS = {
    "free":    {"capacity": 10,   "rate": 10/60},
    "basic":   {"capacity": 100,  "rate": 100/60},
    "premium": {"capacity": 1000, "rate": 1000/60},
}

class TieredRateLimiter:
    def __init__(self):
        self.buckets = {}
        self.queues = {}  # For premium overflow

    def _get_bucket(self, user_id, tier):
        if user_id not in self.buckets:
            cfg = TIER_LIMITS[tier]
            self.buckets[user_id] = TokenBucket(
                cfg["capacity"], cfg["rate"])
        return self.buckets[user_id]

    def handle_request(self, user_id, tier):
        bucket = self._get_bucket(user_id, tier)
        if bucket.allow_request():
            return {"status": "allowed"}

        # Premium users get queued
        if tier == "premium":
            q = self.queues.setdefault(
                user_id, deque(maxlen=50))
            if len(q) < 50:
                q.append(time.time())
                return {"status": "queued",
                        "position": len(q)}

        return {"status": "rejected",
                "retry_after": bucket.wait_time()}

Medium Sliding Window Log

Implement a sliding window log rate limiter that:

  1. Stores exact timestamps of each request
  2. Counts requests in the last N seconds
  3. Cleans up old timestamps to bound memory usage

Use a sorted list or deque of timestamps. On each request, remove timestamps older than the window, then check if count is under the limit.

from collections import deque

class SlidingWindowLog:
    def __init__(self, limit, window_seconds):
        self.limit = limit
        self.window = window_seconds
        self.timestamps = deque()

    def allow_request(self) -> bool:
        now = time.time()
        cutoff = now - self.window

        # Remove expired timestamps
        while (self.timestamps and
               self.timestamps[0] <= cutoff):
            self.timestamps.popleft()

        if len(self.timestamps) < self.limit:
            self.timestamps.append(now)
            return True
        return False

# Pros: Perfect accuracy
# Cons: O(N) memory per user where N = limit

Hard Distributed Rate Limiter Design

Design a rate limiting system for a global API serving 1 million requests per second across 5 data centers. Consider:

  1. How to share rate limit state across data centers
  2. What happens if the Redis cluster is temporarily unavailable
  3. How to handle clock skew between servers

Consider local rate limiting with periodic sync. Use a "fail-open" policy for Redis outages. NTP keeps clocks within milliseconds.

# Distributed Rate Limiter Architecture

# 1. Local + Global approach
# - Each DC has local Redis for fast checks
# - Global Redis syncs counts periodically
# - Local limit = global_limit / num_DCs
# - Allows 20% overflow before global sync

# 2. Fail-open policy
# - If Redis is down, allow requests but log
# - Switch to in-memory local rate limiting
# - Alert ops team immediately

# 3. Clock skew mitigation
# - Use Redis server time (TIME command)
# - NTP keeps servers within ~1ms
# - Window granularity >> clock skew

class DistributedRateLimiter:
    def __init__(self, local_redis, global_redis,
                 global_limit, num_dcs=5):
        self.local = local_redis
        self.global_r = global_redis
        self.local_limit = global_limit // num_dcs
        self.overflow = int(self.local_limit * 0.2)

    def is_allowed(self, client_id):
        try:
            count = self.local.incr(
                f"rl:{client_id}")
            if count == 1:
                self.local.expire(
                    f"rl:{client_id}", 60)
            return count <= self.local_limit
        except Exception:
            # Fail open: allow but log
            return True

Quick Reference

Algorithm Selection Guide

Use Case Recommended Algorithm Why
General API rate limiting Token Bucket Simple, allows bursts, O(1) memory
Strict uniform rate Leaky Bucket Smooths traffic to constant rate
Simple counter per hour Fixed Window Easy to implement, low overhead
Accurate window-based Sliding Window Counter Best accuracy-to-memory ratio
Audit / compliance Sliding Window Log Perfect accuracy, full log of requests

Key Design Considerations

Rate Limiting Checklist

  • Identify the key: IP address, API key, user ID, or combination
  • Choose the scope: Per endpoint, per user, global
  • Set appropriate limits: Based on service capacity, not arbitrary numbers
  • Return helpful headers: X-RateLimit-Limit, Remaining, Reset, Retry-After
  • Handle distributed state: Redis or similar for multi-server setups
  • Fail-open vs fail-closed: What happens when the rate limiter itself is down?
  • Monitoring: Alert on high rejection rates (may indicate an attack or misconfiguration)