Why Rate Limiting Matters
Why This Matters
The Problem: Without rate limiting, a single user or bot can overwhelm your API with millions of requests, causing outages for everyone. DDoS attacks, buggy client loops, and viral traffic spikes can all take down services.
The Solution: Rate limiting controls how many requests a client can make in a given time window. Excess requests are rejected (HTTP 429) or queued.
Real Impact: GitHub limits API calls to 5,000/hour per authenticated user. Twitter limits tweet reads. Stripe limits payment API calls. Without these limits, a single misbehaving client could degrade service for millions.
Real-World Analogy
Think of rate limiting like a nightclub bouncer:
- Token Bucket: The bouncer has a bucket of tokens. Each person entering takes one. Tokens refill at a steady rate. If the bucket is empty, people wait in line.
- Fixed Window: The bouncer counts people entering per hour. At the start of each hour, the count resets to zero.
- Sliding Window: The bouncer tracks the exact time each person entered and counts how many entered in the last 60 minutes.
Where to Apply Rate Limiting
API Gateway Level
Rate limit at the entry point before requests reach your services. Tools: NGINX, Kong, AWS API Gateway, Cloudflare.
Application Level
Fine-grained limits per endpoint, user tier, or operation. Custom logic for business rules (free vs paid tiers).
Database Level
Connection pooling and query rate limits to protect your database from being overwhelmed by application servers.
Client Side
Well-behaved clients implement exponential backoff and respect rate limit headers to avoid hammering the server.
Token Bucket Algorithm
The token bucket is the most widely used rate limiting algorithm. A bucket holds tokens that are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing controlled bursts.
# Token Bucket Rate Limiter
import time
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
"""
capacity: max tokens the bucket can hold (burst size)
refill_rate: tokens added per second
"""
self.capacity = capacity
self.refill_rate = refill_rate
self.tokens = capacity # Start full
self.last_refill = time.time()
def _refill(self):
"""Add tokens based on elapsed time."""
now = time.time()
elapsed = now - self.last_refill
new_tokens = elapsed * self.refill_rate
self.tokens = min(
self.capacity,
self.tokens + new_tokens
)
self.last_refill = now
def allow_request(self, tokens_needed=1) -> bool:
"""Check if request is allowed."""
self._refill()
if self.tokens >= tokens_needed:
self.tokens -= tokens_needed
return True
return False
def wait_time(self) -> float:
"""How long until next token is available."""
if self.tokens >= 1:
return 0.0
return (1 - self.tokens) / self.refill_rate
# Usage: 10 requests/second, burst of 20
limiter = TokenBucket(capacity=20, refill_rate=10)
for i in range(25):
if limiter.allow_request():
print(f"Request {i+1}: Allowed")
else:
print(f"Request {i+1}: Rejected (retry in {limiter.wait_time():.2f}s)")
Sliding Window Algorithm
The sliding window counter combines the simplicity of fixed windows with the accuracy of a true sliding window. It uses weighted counts from the current and previous window to approximate the request count.
Fixed Window vs Sliding Window
| Algorithm | Memory | Accuracy | Burst Handling | Implementation |
|---|---|---|---|---|
| Token Bucket | O(1) per user | Good | Allows controlled bursts up to capacity | Simple; used by AWS, Stripe |
| Leaky Bucket | O(1) per user | Good | Smooths bursts to fixed rate | Like token bucket but processes at fixed rate |
| Fixed Window | O(1) per user | Low | 2x burst at window boundary | Simple counter + reset |
| Sliding Window Log | O(N) per user | Perfect | No burst issues | Store all timestamps; expensive |
| Sliding Window Counter | O(1) per user | Very good | Minimal boundary issues | Weighted average; best trade-off |
Fixed Window Boundary Problem
Problem: With a limit of 100 requests per minute, a user can send 100 requests at 12:00:59 and another 100 at 12:01:00, getting 200 requests through in 2 seconds!
Solution: Use sliding window counters or token buckets instead. They prevent this burst-at-boundary attack.
Distributed Rate Limiting
When your API runs on multiple servers, each server needs to share rate limit state. A centralized store like Redis is the standard approach.
# Distributed rate limiter using Redis
import redis
import time
class RedisRateLimiter:
def __init__(self, redis_client, limit: int,
window_seconds: int):
"""
limit: max requests per window
window_seconds: window size in seconds
"""
self.redis = redis_client
self.limit = limit
self.window = window_seconds
def is_allowed(self, client_id: str) -> dict:
"""Sliding window counter using Redis."""
now = time.time()
window_key = f"rate:{client_id}"
# Use a Redis pipeline for atomicity
pipe = self.redis.pipeline()
# Remove old entries outside the window
pipe.zremrangebyscore(window_key, 0, now - self.window)
# Count current requests in window
pipe.zcard(window_key)
# Add current request with timestamp as score
pipe.zadd(window_key, {f"{now}": now})
# Set TTL so keys auto-expire
pipe.expire(window_key, self.window)
results = pipe.execute()
request_count = results[1] # zcard result
allowed = request_count < self.limit
remaining = max(0, self.limit - request_count - 1)
return {
"allowed": allowed,
"limit": self.limit,
"remaining": remaining if allowed else 0,
"retry_after": self.window if not allowed else 0,
}
# Usage
r = redis.Redis(host="localhost", port=6379)
limiter = RedisRateLimiter(r, limit=100, window_seconds=60)
result = limiter.is_allowed("user:42")
if result["allowed"]:
print(f"Request allowed. {result['remaining']} remaining.")
else:
print(f"Rate limited. Retry after {result['retry_after']}s.")
Rate Limiting HTTP Headers
Standard Rate Limit Headers (RFC 6585 / draft-ietf-httpapi-ratelimit-headers)
- X-RateLimit-Limit: Maximum requests allowed in the window (e.g., 100)
- X-RateLimit-Remaining: Requests remaining in current window (e.g., 57)
- X-RateLimit-Reset: Unix timestamp when the window resets (e.g., 1640000000)
- Retry-After: Seconds to wait before retrying (sent with 429 responses)
# Flask middleware that adds rate limit headers
from flask import Flask, request, jsonify, g
from functools import wraps
app = Flask(__name__)
def rate_limit(limit=100, window=60):
"""Decorator to rate limit an endpoint."""
def decorator(f):
@wraps(f)
def wrapped(*args, **kwargs):
client_id = request.headers.get(
"X-API-Key", request.remote_addr
)
result = limiter.is_allowed(client_id)
# Always set rate limit headers
headers = {
"X-RateLimit-Limit": str(limit),
"X-RateLimit-Remaining": str(result["remaining"]),
"X-RateLimit-Reset": str(
int(time.time()) + window
),
}
if not result["allowed"]:
headers["Retry-After"] = str(window)
return jsonify({
"error": "Rate limit exceeded",
"retry_after": window
}), 429, headers
response = f(*args, **kwargs)
return response, 200, headers
return wrapped
return decorator
@app.route("/api/data")
@rate_limit(limit=100, window=60)
def get_data():
return jsonify({"data": "here"})
Practice Problems
Medium Design a Tiered Rate Limiter
Design a rate limiter that supports different tiers:
- Free tier: 10 requests/minute
- Basic tier: 100 requests/minute
- Premium tier: 1000 requests/minute
- Implement graceful degradation (queue excess requests for premium users instead of rejecting)
Use a dictionary mapping user tiers to token bucket configurations. For premium queuing, use a separate queue with a maximum size.
from collections import deque
TIER_LIMITS = {
"free": {"capacity": 10, "rate": 10/60},
"basic": {"capacity": 100, "rate": 100/60},
"premium": {"capacity": 1000, "rate": 1000/60},
}
class TieredRateLimiter:
def __init__(self):
self.buckets = {}
self.queues = {} # For premium overflow
def _get_bucket(self, user_id, tier):
if user_id not in self.buckets:
cfg = TIER_LIMITS[tier]
self.buckets[user_id] = TokenBucket(
cfg["capacity"], cfg["rate"])
return self.buckets[user_id]
def handle_request(self, user_id, tier):
bucket = self._get_bucket(user_id, tier)
if bucket.allow_request():
return {"status": "allowed"}
# Premium users get queued
if tier == "premium":
q = self.queues.setdefault(
user_id, deque(maxlen=50))
if len(q) < 50:
q.append(time.time())
return {"status": "queued",
"position": len(q)}
return {"status": "rejected",
"retry_after": bucket.wait_time()}
Medium Sliding Window Log
Implement a sliding window log rate limiter that:
- Stores exact timestamps of each request
- Counts requests in the last N seconds
- Cleans up old timestamps to bound memory usage
Use a sorted list or deque of timestamps. On each request, remove timestamps older than the window, then check if count is under the limit.
from collections import deque
class SlidingWindowLog:
def __init__(self, limit, window_seconds):
self.limit = limit
self.window = window_seconds
self.timestamps = deque()
def allow_request(self) -> bool:
now = time.time()
cutoff = now - self.window
# Remove expired timestamps
while (self.timestamps and
self.timestamps[0] <= cutoff):
self.timestamps.popleft()
if len(self.timestamps) < self.limit:
self.timestamps.append(now)
return True
return False
# Pros: Perfect accuracy
# Cons: O(N) memory per user where N = limit
Hard Distributed Rate Limiter Design
Design a rate limiting system for a global API serving 1 million requests per second across 5 data centers. Consider:
- How to share rate limit state across data centers
- What happens if the Redis cluster is temporarily unavailable
- How to handle clock skew between servers
Consider local rate limiting with periodic sync. Use a "fail-open" policy for Redis outages. NTP keeps clocks within milliseconds.
# Distributed Rate Limiter Architecture
# 1. Local + Global approach
# - Each DC has local Redis for fast checks
# - Global Redis syncs counts periodically
# - Local limit = global_limit / num_DCs
# - Allows 20% overflow before global sync
# 2. Fail-open policy
# - If Redis is down, allow requests but log
# - Switch to in-memory local rate limiting
# - Alert ops team immediately
# 3. Clock skew mitigation
# - Use Redis server time (TIME command)
# - NTP keeps servers within ~1ms
# - Window granularity >> clock skew
class DistributedRateLimiter:
def __init__(self, local_redis, global_redis,
global_limit, num_dcs=5):
self.local = local_redis
self.global_r = global_redis
self.local_limit = global_limit // num_dcs
self.overflow = int(self.local_limit * 0.2)
def is_allowed(self, client_id):
try:
count = self.local.incr(
f"rl:{client_id}")
if count == 1:
self.local.expire(
f"rl:{client_id}", 60)
return count <= self.local_limit
except Exception:
# Fail open: allow but log
return True
Quick Reference
Algorithm Selection Guide
| Use Case | Recommended Algorithm | Why |
|---|---|---|
| General API rate limiting | Token Bucket | Simple, allows bursts, O(1) memory |
| Strict uniform rate | Leaky Bucket | Smooths traffic to constant rate |
| Simple counter per hour | Fixed Window | Easy to implement, low overhead |
| Accurate window-based | Sliding Window Counter | Best accuracy-to-memory ratio |
| Audit / compliance | Sliding Window Log | Perfect accuracy, full log of requests |
Key Design Considerations
Rate Limiting Checklist
- Identify the key: IP address, API key, user ID, or combination
- Choose the scope: Per endpoint, per user, global
- Set appropriate limits: Based on service capacity, not arbitrary numbers
- Return helpful headers: X-RateLimit-Limit, Remaining, Reset, Retry-After
- Handle distributed state: Redis or similar for multi-server setups
- Fail-open vs fail-closed: What happens when the rate limiter itself is down?
- Monitoring: Alert on high rejection rates (may indicate an attack or misconfiguration)