Scalability & Performance

Easy 22 min read

Why Scalability Matters

Why Scalability Matters

The Problem: Your app works great with 100 users. But what happens when you go from 100 to 100,000 to 10 million users? Without a scalable architecture, your system slows down, crashes, or becomes prohibitively expensive to run.

The Solution: Design your system to grow gracefully by understanding scaling strategies, performance metrics, and optimization techniques.

Real Impact: Netflix scaled from mailing DVDs to streaming video to 250+ million subscribers worldwide by mastering horizontal scaling and microservices.

Real-World Analogy

Think of scalability like a restaurant:

  • Vertical scaling = Hiring a faster chef (bigger, more powerful machine)
  • Horizontal scaling = Opening more locations (adding more machines)
  • Latency = How long a customer waits for their food
  • Throughput = How many customers you can serve per hour
  • Bottleneck = The slowest station in the kitchen

Vertical vs Horizontal Scaling

There are two fundamental approaches to scaling a system. Each has trade-offs, and most real-world systems use a combination of both.

Vertical vs Horizontal Scaling
Vertical Scaling (Scale Up) 4 CPU 8 GB Small 32 CPU 128 GB Bigger Machine Trade-offs + Simple to implement + No code changes needed - Hardware limits (ceiling) - Single point of failure Horizontal Scaling (Scale Out) 1 Server S1 S2 S3 S4 S5 S6+ More Machines Trade-offs + No hardware ceiling + Built-in redundancy - More complex architecture - Data consistency challenges When to Use Each Vertical: Databases, early-stage apps Quick wins, predictable workloads Horizontal: Web servers, stateless services High traffic, need for fault tolerance

Vertical Scaling (Scale Up)

Add more power to your existing machine -- more CPU, more RAM, faster storage. Simple but has an upper limit. A single server can only get so big.

Horizontal Scaling (Scale Out)

Add more machines to your pool. Distributes the load across multiple servers. More complex but virtually unlimited. This is how cloud-native applications scale.

Common Pitfall: Premature Optimization

Do not over-engineer for scale you do not have yet. Start simple (vertical), then move to horizontal when you hit limits. Instagram famously ran on a single database server for a long time before sharding.

Latency and Throughput

Latency and throughput are the two most important performance metrics in system design. They are related but measure different things.

Metric Definition Analogy Optimize By
Latency Time for a single request to complete How long one car takes to drive from A to B Caching, CDNs, faster algorithms
Throughput Number of requests handled per second How many cars pass a point per hour Horizontal scaling, async processing
Bandwidth Maximum data that can be transferred Number of lanes on the highway Compression, better network links
P50 / P99 50th / 99th percentile response time Typical vs worst-case drive time Identify outliers, optimize tail latency

Understanding Percentiles

Why Averages Lie

Averages can hide problems. If 99% of requests take 50ms but 1% take 5 seconds, the average looks fine (~100ms) but 1 in 100 users has a terrible experience.

Use percentiles instead:

  • P50 (median): 50% of requests are faster than this
  • P95: 95% of requests are faster -- catches most slow requests
  • P99: 99% of requests are faster -- the "tail latency"
  • P99.9: Used by high-traffic services like Amazon (even 0.1% = thousands of unhappy users)
measure_latency.py
import time
import numpy as np
import requests

def benchmark_endpoint(url, num_requests=1000):
    """Measure latency percentiles for an endpoint."""
    latencies = []

    for _ in range(num_requests):
        start = time.perf_counter()
        response = requests.get(url)
        end = time.perf_counter()
        latency_ms = (end - start) * 1000
        latencies.append(latency_ms)

    latencies = np.array(latencies)

    print(f"--- Latency Report ({num_requests} requests) ---")
    print(f"P50  (median): {np.percentile(latencies, 50):.1f} ms")
    print(f"P95:           {np.percentile(latencies, 95):.1f} ms")
    print(f"P99:           {np.percentile(latencies, 99):.1f} ms")
    print(f"P99.9:         {np.percentile(latencies, 99.9):.1f} ms")
    print(f"Max:           {np.max(latencies):.1f} ms")
    print(f"Throughput:     {num_requests / (np.sum(latencies) / 1000):.0f} req/s")

# Example usage
benchmark_endpoint("http://localhost:8080/api/health")

# Sample output:
# --- Latency Report (1000 requests) ---
# P50  (median): 12.3 ms
# P95:           45.7 ms
# P99:           123.4 ms
# P99.9:         456.2 ms
# Max:           892.1 ms
# Throughput:     847 req/s

Performance Optimization Strategies

Once you understand your performance metrics, here are the key strategies to improve them.

Caching

Store frequently accessed data in fast storage (memory). Reduces database load and latency. Use Redis or Memcached for application-level caching.

Database Indexing

Create indexes on frequently queried columns. Turns O(n) table scans into O(log n) lookups. The single biggest performance win for most applications.

Connection Pooling

Reuse database connections instead of creating new ones for each request. Eliminates the overhead of TCP handshakes and authentication.

Async Processing

Move slow operations (email sending, image processing, analytics) to background job queues. Return immediately to the user and process later.

Content Delivery Networks

Serve static assets (images, CSS, JS) from edge servers close to users. Reduces latency for users far from your origin server.

Data Compression

Compress responses with gzip or Brotli. Reduces bandwidth usage and transfer time, especially for text-based content like JSON and HTML.

Connection Pooling Example

connection_pool.py
import psycopg2
from psycopg2 import pool
import time

# WITHOUT connection pooling (slow)
def query_without_pool(query, n_queries=100):
    start = time.time()
    for _ in range(n_queries):
        # New connection for EVERY query (expensive!)
        conn = psycopg2.connect(
            host="localhost",
            database="mydb",
            user="user",
            password="pass"
        )
        cursor = conn.cursor()
        cursor.execute(query)
        cursor.fetchall()
        conn.close()  # Connection destroyed
    elapsed = time.time() - start
    print(f"Without pool: {elapsed:.2f}s for {n_queries} queries")

# WITH connection pooling (fast)
def query_with_pool(query, n_queries=100):
    # Create pool once (reuse connections)
    connection_pool = pool.SimpleConnectionPool(
        minconn=5,   # Minimum connections to keep open
        maxconn=20,  # Maximum connections allowed
        host="localhost",
        database="mydb",
        user="user",
        password="pass"
    )

    start = time.time()
    for _ in range(n_queries):
        conn = connection_pool.getconn()  # Borrow from pool
        cursor = conn.cursor()
        cursor.execute(query)
        cursor.fetchall()
        connection_pool.putconn(conn)     # Return to pool
    elapsed = time.time() - start
    print(f"With pool:    {elapsed:.2f}s for {n_queries} queries")
    connection_pool.closeall()

# Typical results:
# Without pool: 4.52s for 100 queries
# With pool:    0.31s for 100 queries  (~15x faster!)

Async Processing with Task Queues

async_tasks.py
from celery import Celery
import time

# Create a Celery app connected to Redis as broker
app = Celery("tasks", broker="redis://localhost:6379/0")

# Define a slow task that runs in the background
@app.task
def send_welcome_email(user_email):
    """Simulate sending an email (takes 3 seconds)."""
    time.sleep(3)  # Simulate SMTP delay
    print(f"Email sent to {user_email}")
    return True

@app.task
def process_uploaded_image(image_path):
    """Resize and optimize an uploaded image."""
    time.sleep(5)  # Simulate image processing
    print(f"Image processed: {image_path}")
    return True

# In your API handler (responds instantly!):
def register_user(user_data):
    # Save user to database (fast)
    user = save_to_db(user_data)

    # Queue email for background processing (instant)
    send_welcome_email.delay(user.email)

    # Return response immediately -- don't wait for email
    return {"status": "registered", "user_id": user.id}

Practice Problems

Easy Identify the Bottleneck

A web application has the following performance profile:

  • Web server response time: 5ms
  • Application logic: 15ms
  • Database query: 200ms
  • Network round-trip: 30ms

What is the total latency? Where is the bottleneck? How would you optimize it?

Add up all the times for total latency. The bottleneck is the slowest component. Think about caching, indexing, and query optimization.

# Total latency = 5 + 15 + 200 + 30 = 250ms
# Bottleneck: Database query (200ms = 80% of total)

# Optimization strategies:
# 1. Add database indexes on queried columns
# 2. Cache frequent queries in Redis
#    - Cache hit: ~1ms vs 200ms DB query
# 3. Optimize the SQL query (EXPLAIN ANALYZE)
# 4. Add read replicas for read-heavy workloads

# After optimization:
# Cache hit:  5 + 15 + 1 + 30 = 51ms  (5x faster)
# Cache miss: 5 + 15 + 50 + 30 = 100ms (with index)

Medium Scaling Strategy Design

You run an e-commerce platform with:

  • Normal traffic: 1,000 requests/second
  • Black Friday peak: 50,000 requests/second
  • Single server handles 2,000 requests/second

Design a scaling strategy. How many servers at peak? What components need horizontal scaling?

Calculate minimum servers needed at peak. Add a safety buffer. Think about which layers can be stateless (easy to scale) vs stateful (harder).

# Minimum servers at peak
peak_rps = 50_000
server_capacity = 2_000
min_servers = peak_rps / server_capacity  # 25 servers

# Add 20% safety buffer
target_servers = int(min_servers * 1.2)  # 30 servers

# Scaling strategy:
# 1. Web/App servers: Horizontal (stateless)
#    - Auto-scale from 2 to 30 based on CPU/RPS
#    - Use load balancer to distribute traffic
# 2. Database: Vertical + read replicas
#    - Primary for writes, 3-5 replicas for reads
# 3. Cache layer: Redis cluster (3+ nodes)
#    - Cache product pages, session data
# 4. CDN for static assets
#    - Offload image/CSS/JS serving
# 5. Queue for order processing
#    - Decouple checkout from fulfillment

Medium Amdahl's Law

Your application spends 60% of its time on database queries and 40% on computation. If you make the database queries 10x faster, what is the theoretical speedup of the whole system?

Use Amdahl's Law: Speedup = 1 / ((1 - P) + P/S), where P is the fraction improved and S is the speedup factor.

# Amdahl's Law: Speedup = 1 / ((1 - P) + P/S)
# P = fraction that can be improved = 0.60
# S = speedup of that fraction = 10x

P = 0.60  # 60% is DB queries
S = 10    # 10x faster DB

speedup = 1 / ((1 - P) + P / S)
print(f"Theoretical speedup: {speedup:.2f}x")
# Output: Theoretical speedup: 2.17x

# Even making DB 10x faster only gives 2.17x overall!
# The 40% computation becomes the new bottleneck.

# Maximum possible speedup (DB infinitely fast):
max_speedup = 1 / (1 - P)
print(f"Max possible: {max_speedup:.2f}x")
# Output: Max possible: 2.50x

# Lesson: You must optimize ALL bottlenecks,
# not just the obvious one.

Quick Reference

Scaling Strategies Comparison

Strategy Type Best For Complexity
Vertical Scaling Scale Up Databases, quick fixes Low
Horizontal Scaling Scale Out Stateless web servers Medium
Caching Optimization Read-heavy workloads Low-Medium
Database Sharding Scale Out Large datasets High
CDN Edge Distribution Static content, global users Low
Async Processing Decoupling Slow background tasks Medium
Read Replicas Scale Out Read-heavy databases Medium

Key Latency Numbers

Operation Time Notes
L1 cache reference 0.5 ns Fastest possible access
L2 cache reference 7 ns 14x L1
Main memory 100 ns 200x L1
Redis GET ~0.1 ms In-memory, over network
SSD random read ~0.15 ms 150x memory
Database query (indexed) ~1-10 ms Depends on query complexity
Datacenter round-trip ~0.5 ms Within same datacenter
Cross-continent round-trip ~150 ms Speed of light limit

Key Takeaways

  • Start with vertical scaling, move to horizontal when you hit limits
  • Measure with percentiles (P50, P95, P99), not averages
  • Identify bottlenecks before optimizing -- use Amdahl's Law
  • Caching is almost always the first optimization to try
  • Move slow work to background queues