Scalability & Performance | LIZIU System Design

Why Scalability Matters

The Problem: Your app works great with 100 users. But what happens when you go from 100 to 100,000 to 10 million users? Without a scalable architecture, your system slows down, crashes, or becomes prohibitively expensive to run.

The Solution: Design your system to grow gracefully by understanding scaling strategies, performance metrics, and optimization techniques.

Real Impact: Netflix scaled from mailing DVDs to streaming video to 250+ million subscribers worldwide by mastering horizontal scaling and microservices.

Real-World Analogy

Think of scalability like a restaurant:

Vertical scaling = Hiring a faster chef (bigger, more powerful machine)
Horizontal scaling = Opening more locations (adding more machines)
Latency = How long a customer waits for their food
Throughput = How many customers you can serve per hour
Bottleneck = The slowest station in the kitchen

Vertical vs Horizontal Scaling

There are two fundamental approaches to scaling a system. Each has trade-offs, and most real-world systems use a combination of both.

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up)

Add more power to your existing machine -- more CPU, more RAM, faster storage. Simple but has an upper limit. A single server can only get so big.

Horizontal Scaling (Scale Out)

Add more machines to your pool. Distributes the load across multiple servers. More complex but virtually unlimited. This is how cloud-native applications scale.

Common Pitfall: Premature Optimization

Do not over-engineer for scale you do not have yet. Start simple (vertical), then move to horizontal when you hit limits. Instagram famously ran on a single database server for a long time before sharding.

Latency and Throughput

Latency and throughput are the two most important performance metrics in system design. They are related but measure different things.

Metric	Definition	Analogy	Optimize By
Latency	Time for a single request to complete	How long one car takes to drive from A to B	Caching, CDNs, faster algorithms
Throughput	Number of requests handled per second	How many cars pass a point per hour	Horizontal scaling, async processing
Bandwidth	Maximum data that can be transferred	Number of lanes on the highway	Compression, better network links
P50 / P99	50th / 99th percentile response time	Typical vs worst-case drive time	Identify outliers, optimize tail latency

Understanding Percentiles

Why Averages Lie

Averages can hide problems. If 99% of requests take 50ms but 1% take 5 seconds, the average looks fine (~100ms) but 1 in 100 users has a terrible experience.

Use percentiles instead:

P50 (median): 50% of requests are faster than this
P95: 95% of requests are faster -- catches most slow requests
P99: 99% of requests are faster -- the "tail latency"
P99.9: Used by high-traffic services like Amazon (even 0.1% = thousands of unhappy users)

measure_latency.py

import time
import numpy as np
import requests

def benchmark_endpoint(url, num_requests=1000):
    """Measure latency percentiles for an endpoint."""
    latencies = []

    for _ in range(num_requests):
        start = time.perf_counter()
        response = requests.get(url)
        end = time.perf_counter()
        latency_ms = (end - start) * 1000
        latencies.append(latency_ms)

    latencies = np.array(latencies)

    print(f"--- Latency Report ({num_requests} requests) ---")
    print(f"P50  (median): {np.percentile(latencies, 50):.1f} ms")
    print(f"P95:           {np.percentile(latencies, 95):.1f} ms")
    print(f"P99:           {np.percentile(latencies, 99):.1f} ms")
    print(f"P99.9:         {np.percentile(latencies, 99.9):.1f} ms")
    print(f"Max:           {np.max(latencies):.1f} ms")
    print(f"Throughput:     {num_requests / (np.sum(latencies) / 1000):.0f} req/s")

# Example usage
benchmark_endpoint("http://localhost:8080/api/health")

# Sample output:
# --- Latency Report (1000 requests) ---
# P50  (median): 12.3 ms
# P95:           45.7 ms
# P99:           123.4 ms
# P99.9:         456.2 ms
# Max:           892.1 ms
# Throughput:     847 req/s

Performance Optimization Strategies

Once you understand your performance metrics, here are the key strategies to improve them.

Caching

Store frequently accessed data in fast storage (memory). Reduces database load and latency. Use Redis or Memcached for application-level caching.

Database Indexing

Create indexes on frequently queried columns. Turns O(n) table scans into O(log n) lookups. The single biggest performance win for most applications.

Connection Pooling

Reuse database connections instead of creating new ones for each request. Eliminates the overhead of TCP handshakes and authentication.

Async Processing

Move slow operations (email sending, image processing, analytics) to background job queues. Return immediately to the user and process later.

Content Delivery Networks

Serve static assets (images, CSS, JS) from edge servers close to users. Reduces latency for users far from your origin server.

Data Compression

Compress responses with gzip or Brotli. Reduces bandwidth usage and transfer time, especially for text-based content like JSON and HTML.

Connection Pooling Example

connection_pool.py

import psycopg2
from psycopg2 import pool
import time

# WITHOUT connection pooling (slow)
def query_without_pool(query, n_queries=100):
    start = time.time()
    for _ in range(n_queries):
        # New connection for EVERY query (expensive!)
        conn = psycopg2.connect(
            host="localhost",
            database="mydb",
            user="user",
            password="pass"
        )
        cursor = conn.cursor()
        cursor.execute(query)
        cursor.fetchall()
        conn.close()  # Connection destroyed
    elapsed = time.time() - start
    print(f"Without pool: {elapsed:.2f}s for {n_queries} queries")

# WITH connection pooling (fast)
def query_with_pool(query, n_queries=100):
    # Create pool once (reuse connections)
    connection_pool = pool.SimpleConnectionPool(
        minconn=5,   # Minimum connections to keep open
        maxconn=20,  # Maximum connections allowed
        host="localhost",
        database="mydb",
        user="user",
        password="pass"
    )

    start = time.time()
    for _ in range(n_queries):
        conn = connection_pool.getconn()  # Borrow from pool
        cursor = conn.cursor()
        cursor.execute(query)
        cursor.fetchall()
        connection_pool.putconn(conn)     # Return to pool
    elapsed = time.time() - start
    print(f"With pool:    {elapsed:.2f}s for {n_queries} queries")
    connection_pool.closeall()

# Typical results:
# Without pool: 4.52s for 100 queries
# With pool:    0.31s for 100 queries  (~15x faster!)

Async Processing with Task Queues

async_tasks.py

from celery import Celery
import time

# Create a Celery app connected to Redis as broker
app = Celery("tasks", broker="redis://localhost:6379/0")

# Define a slow task that runs in the background
@app.task
def send_welcome_email(user_email):
    """Simulate sending an email (takes 3 seconds)."""
    time.sleep(3)  # Simulate SMTP delay
    print(f"Email sent to {user_email}")
    return True

@app.task
def process_uploaded_image(image_path):
    """Resize and optimize an uploaded image."""
    time.sleep(5)  # Simulate image processing
    print(f"Image processed: {image_path}")
    return True

# In your API handler (responds instantly!):
def register_user(user_data):
    # Save user to database (fast)
    user = save_to_db(user_data)

    # Queue email for background processing (instant)
    send_welcome_email.delay(user.email)

    # Return response immediately -- don't wait for email
    return {"status": "registered", "user_id": user.id}

Practice Problems

Easy Identify the Bottleneck

A web application has the following performance profile:

Web server response time: 5ms
Application logic: 15ms
Database query: 200ms
Network round-trip: 30ms

What is the total latency? Where is the bottleneck? How would you optimize it?

Add up all the times for total latency. The bottleneck is the slowest component. Think about caching, indexing, and query optimization.

# Total latency = 5 + 15 + 200 + 30 = 250ms
# Bottleneck: Database query (200ms = 80% of total)

# Optimization strategies:
# 1. Add database indexes on queried columns
# 2. Cache frequent queries in Redis
#    - Cache hit: ~1ms vs 200ms DB query
# 3. Optimize the SQL query (EXPLAIN ANALYZE)
# 4. Add read replicas for read-heavy workloads

# After optimization:
# Cache hit:  5 + 15 + 1 + 30 = 51ms  (5x faster)
# Cache miss: 5 + 15 + 50 + 30 = 100ms (with index)

Medium Scaling Strategy Design

You run an e-commerce platform with:

Normal traffic: 1,000 requests/second
Black Friday peak: 50,000 requests/second
Single server handles 2,000 requests/second

Design a scaling strategy. How many servers at peak? What components need horizontal scaling?

Calculate minimum servers needed at peak. Add a safety buffer. Think about which layers can be stateless (easy to scale) vs stateful (harder).

# Minimum servers at peak
peak_rps = 50_000
server_capacity = 2_000
min_servers = peak_rps / server_capacity  # 25 servers

# Add 20% safety buffer
target_servers = int(min_servers * 1.2)  # 30 servers

# Scaling strategy:
# 1. Web/App servers: Horizontal (stateless)
#    - Auto-scale from 2 to 30 based on CPU/RPS
#    - Use load balancer to distribute traffic
# 2. Database: Vertical + read replicas
#    - Primary for writes, 3-5 replicas for reads
# 3. Cache layer: Redis cluster (3+ nodes)
#    - Cache product pages, session data
# 4. CDN for static assets
#    - Offload image/CSS/JS serving
# 5. Queue for order processing
#    - Decouple checkout from fulfillment

Medium Amdahl's Law

Your application spends 60% of its time on database queries and 40% on computation. If you make the database queries 10x faster, what is the theoretical speedup of the whole system?

Use Amdahl's Law: Speedup = 1 / ((1 - P) + P/S), where P is the fraction improved and S is the speedup factor.

# Amdahl's Law: Speedup = 1 / ((1 - P) + P/S)
# P = fraction that can be improved = 0.60
# S = speedup of that fraction = 10x

P = 0.60  # 60% is DB queries
S = 10    # 10x faster DB

speedup = 1 / ((1 - P) + P / S)
print(f"Theoretical speedup: {speedup:.2f}x")
# Output: Theoretical speedup: 2.17x

# Even making DB 10x faster only gives 2.17x overall!
# The 40% computation becomes the new bottleneck.

# Maximum possible speedup (DB infinitely fast):
max_speedup = 1 / (1 - P)
print(f"Max possible: {max_speedup:.2f}x")
# Output: Max possible: 2.50x

# Lesson: You must optimize ALL bottlenecks,
# not just the obvious one.

Quick Reference

Scaling Strategies Comparison

Strategy	Type	Best For	Complexity
Vertical Scaling	Scale Up	Databases, quick fixes	Low
Horizontal Scaling	Scale Out	Stateless web servers	Medium
Caching	Optimization	Read-heavy workloads	Low-Medium
Database Sharding	Scale Out	Large datasets	High
CDN	Edge Distribution	Static content, global users	Low
Async Processing	Decoupling	Slow background tasks	Medium
Read Replicas	Scale Out	Read-heavy databases	Medium

Key Latency Numbers

Operation	Time	Notes
L1 cache reference	0.5 ns	Fastest possible access
L2 cache reference	7 ns	14x L1
Main memory	100 ns	200x L1
Redis GET	~0.1 ms	In-memory, over network
SSD random read	~0.15 ms	150x memory
Database query (indexed)	~1-10 ms	Depends on query complexity
Datacenter round-trip	~0.5 ms	Within same datacenter
Cross-continent round-trip	~150 ms	Speed of light limit

Key Takeaways

Start with vertical scaling, move to horizontal when you hit limits
Measure with percentiles (P50, P95, P99), not averages
Identify bottlenecks before optimizing -- use Amdahl's Law
Caching is almost always the first optimization to try
Move slow work to background queues

Why Scalability Matters

Why Scalability Matters

Real-World Analogy

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

Common Pitfall: Premature Optimization

Latency and Throughput

Understanding Percentiles

Why Averages Lie

Performance Optimization Strategies

Caching

Database Indexing

Connection Pooling

Async Processing

Content Delivery Networks

Data Compression

Connection Pooling Example

Async Processing with Task Queues

Practice Problems

Easy Identify the Bottleneck

Medium Scaling Strategy Design

Medium Amdahl's Law

Quick Reference

Scaling Strategies Comparison

Key Latency Numbers

Key Takeaways

Related Topics