What is System Design?
Why System Design Matters
The Problem: Building software that works on your laptop is easy. Building software that serves millions of users reliably, efficiently, and at scale is an entirely different challenge.
The Solution: System design gives you the principles, patterns, and building blocks to architect software systems that are reliable, scalable, and maintainable.
Real Impact: Every major tech company -- from Google to Netflix to Amazon -- relies on solid system design to serve billions of requests per day.
Real-World Analogy
Think of system design like city planning:
- Roads = Network connections between services
- Traffic lights = Load balancers distributing traffic
- Warehouses = Databases storing your data
- Post offices = Message queues handling async communication
- Building codes = Design principles ensuring reliability
Just as a city needs thoughtful planning to handle growth, your software systems need careful design to handle increasing users and data.
What You Will Learn
System design covers a broad set of topics. In this tutorial series, you will learn how to think about and build large-scale distributed systems. Here is a roadmap of what we will cover:
Fundamentals
Client-server model, design principles, networking basics, and the vocabulary you need to discuss systems.
Building Blocks
Databases, caches, load balancers, CDNs, message queues, and other core infrastructure components.
Patterns & Strategies
Microservices, data replication, rate limiting, and proven approaches to common challenges.
Real-World Designs
End-to-end design of systems like URL shorteners, chat apps, news feeds, and video streaming platforms.
Client-Server Architecture
The client-server model is the foundation of virtually every modern web application. A client (like your web browser or mobile app) sends requests to a server, which processes those requests and returns responses.
How a Request Flows
Step-by-Step Request Flow
- DNS Resolution: The client resolves the domain name (e.g., api.example.com) to an IP address
- TCP Connection: A TCP connection is established via a three-way handshake
- HTTP Request: The client sends an HTTP request (GET, POST, PUT, DELETE)
- Server Processing: The server receives the request, processes it, queries the database if needed
- HTTP Response: The server sends back a response with a status code and data
- Client Rendering: The client receives the response and renders it for the user
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
class SimpleHandler(BaseHTTPRequestHandler):
def do_GET(self):
# Set response status code
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.end_headers()
# Create response body
response = {
"message": "Hello from the server!",
"status": "healthy",
"version": "1.0.0"
}
# Send JSON response
self.wfile.write(json.dumps(response).encode())
# Start the server on port 8080
server = HTTPServer(("0.0.0.0", 8080), SimpleHandler)
print("Server running on http://localhost:8080")
server.serve_forever()
import requests
# Send a GET request to the server
response = requests.get("http://localhost:8080")
# Check the status code
print(f"Status Code: {response.status_code}")
print(f"Response: {response.json()}")
# Output:
# Status Code: 200
# Response: {'message': 'Hello from the server!', 'status': 'healthy', 'version': '1.0.0'}
Key Design Principles
Every well-designed system balances four fundamental properties. Understanding these trade-offs is at the heart of system design.
Reliability
The system continues to work correctly even when things go wrong -- hardware faults, software bugs, or human errors. A reliable system is fault-tolerant and resilient.
Availability
The system is accessible and operational when users need it. Measured in "nines" -- 99.9% availability means at most 8.76 hours of downtime per year.
Scalability
The system can handle growth -- whether in data volume, traffic, or complexity -- without degrading performance. Scale up (bigger machines) or scale out (more machines).
Maintainability
The system is easy to operate, understand, and evolve. Good maintainability means new engineers can quickly become productive and changes are easy to make safely.
Availability Nines
| Availability | Uptime % | Downtime per Year | Downtime per Month |
|---|---|---|---|
| Two nines | 99% | 3.65 days | 7.31 hours |
| Three nines | 99.9% | 8.76 hours | 43.83 minutes |
| Four nines | 99.99% | 52.6 minutes | 4.38 minutes |
| Five nines | 99.999% | 5.26 minutes | 26.3 seconds |
Common Pitfall: Ignoring Trade-Offs
You cannot maximize all four properties simultaneously. Every design decision involves trade-offs:
- Consistency vs. Availability: The CAP theorem states you can only guarantee two of three: Consistency, Availability, Partition tolerance
- Latency vs. Throughput: Optimizing for one often comes at the expense of the other
- Simplicity vs. Flexibility: More features mean more complexity
The System Design Interview Framework
Whether you are designing a system at work or tackling a system design interview, following a structured framework ensures you cover all the important aspects.
The RESHADED Framework
Use this step-by-step approach for any system design problem:
1. Requirements Clarification
Ask questions to understand functional and non-functional requirements. Who are the users? What are the core features? What scale are we targeting?
2. Estimation
Do back-of-the-envelope calculations. How many users? How much data? What throughput? These numbers drive your design decisions.
3. Storage Schema
Define your data model. What entities do you need? What are the relationships? SQL or NoSQL? How will data grow over time?
4. High-Level Design
Draw the big picture -- clients, servers, databases, caches, queues. Show how components connect and data flows through the system.
5. API Design
Define the interface. What endpoints do you need? What are the request/response formats? REST, GraphQL, or gRPC?
6. Detailed Design
Dive deep into critical components. How does the caching layer work? How do you handle failures? What algorithms power the core features?
7. Evaluate & Optimize
Identify bottlenecks. How does the system handle edge cases? What are single points of failure? How would you monitor and alert?
8. Distinguish Yourself
Discuss trade-offs you made and alternatives you considered. Show depth by addressing security, cost optimization, or regional deployment.
Back-of-the-Envelope Estimation
Key Numbers Every Engineer Should Know
- L1 cache reference: ~0.5 ns
- L2 cache reference: ~7 ns
- Main memory reference: ~100 ns
- SSD random read: ~150 us
- HDD seek: ~10 ms
- Round trip within same datacenter: ~0.5 ms
- Round trip CA to Netherlands: ~150 ms
# Back-of-the-envelope estimation for a Twitter-like service
# User assumptions
total_users = 500_000_000 # 500M total users
daily_active_users = 200_000_000 # 200M DAU (40%)
tweets_per_user_per_day = 2 # average tweets/day
# Write throughput
tweets_per_day = daily_active_users * tweets_per_user_per_day
tweets_per_second = tweets_per_day / (24 * 3600)
print(f"Tweets/day: {tweets_per_day:,.0f}")
print(f"Tweets/second: {tweets_per_second:,.0f}")
# Storage estimation
avg_tweet_size_bytes = 300 # text + metadata
daily_storage_gb = (tweets_per_day * avg_tweet_size_bytes) / (1024**3)
yearly_storage_tb = daily_storage_gb * 365 / 1024
print(f"Daily storage: {daily_storage_gb:.1f} GB")
print(f"Yearly storage: {yearly_storage_tb:.1f} TB")
# Read throughput (reads >> writes, ~100:1 ratio)
reads_per_second = tweets_per_second * 100
print(f"Reads/second: {reads_per_second:,.0f}")
# Output:
# Tweets/day: 400,000,000
# Tweets/second: 4,630
# Daily storage: 111.8 GB
# Yearly storage: 39.9 TB
# Reads/second: 463,000
Practice Problems
Easy Identify the Components
For a simple e-commerce website, identify:
- What are the clients?
- What server-side components would you need?
- What data needs to be stored?
- What are the key design principles to prioritize?
Think about the different types of users (shoppers, admins) and the different types of data (products, orders, users). Consider which design principles matter most for an e-commerce platform.
# Clients:
# - Web browsers (desktop shoppers)
# - Mobile apps (iOS/Android)
# - Admin dashboard
# Server-side components:
# - Web server (Nginx) for static files
# - Application server (product catalog, cart, checkout)
# - Database (products, users, orders)
# - Cache (popular products, session data)
# - Payment processing service
# - Search service (product search)
# Data to store:
# - Product catalog (name, price, description, images)
# - User accounts (email, password hash, addresses)
# - Orders (items, totals, shipping status)
# - Shopping cart state
# - Reviews and ratings
# Key principles:
# - Availability (store must be accessible 24/7)
# - Reliability (orders must never be lost)
# - Scalability (handle Black Friday traffic spikes)
Easy Estimation Practice
Estimate the storage requirements for a photo-sharing service:
- 100 million daily active users
- Each user uploads 2 photos per day on average
- Average photo size is 2 MB
- How much storage do you need per day? Per year?
Multiply users * photos/day * size/photo. Remember to account for thumbnails and multiple resolutions. Convert bytes to TB for readability.
# Storage estimation
dau = 100_000_000 # 100M daily active users
photos_per_day = 2 # average uploads/user
avg_photo_mb = 2 # MB per photo
# Daily storage
daily_photos = dau * photos_per_day # 200M photos
daily_storage_tb = (daily_photos * avg_photo_mb) / (1024 * 1024)
print(f"Daily: {daily_storage_tb:.0f} TB") # ~381 TB
# Yearly storage
yearly_storage_pb = daily_storage_tb * 365 / 1024
print(f"Yearly: {yearly_storage_pb:.1f} PB") # ~136 PB
# With thumbnails (3 sizes: small, medium, large)
# Roughly 1.3x overhead for extra sizes
total_yearly_pb = yearly_storage_pb * 1.3
print(f"With thumbnails: {total_yearly_pb:.1f} PB")
Medium Design a Simple URL Shortener
Using the interview framework, outline the high-level design for a URL shortener:
- List the functional requirements
- List the non-functional requirements
- Estimate the scale (assume 100M URLs created per month)
- Draw a high-level architecture
Functional: shorten URL, redirect, optional custom alias. Non-functional: low latency redirects, high availability. For estimation, calculate reads vs writes (100:1 ratio is typical).
# Functional Requirements:
# 1. Given a long URL, generate a short URL
# 2. Given a short URL, redirect to the original
# 3. Optional: custom short URLs
# 4. Optional: analytics (click count, referrers)
# Non-Functional Requirements:
# 1. Low latency redirects (< 100ms)
# 2. High availability (99.99%)
# 3. Short URLs should not be guessable
# Scale Estimation:
writes_per_month = 100_000_000 # 100M new URLs/month
writes_per_second = writes_per_month / (30 * 24 * 3600)
reads_per_second = writes_per_second * 100
print(f"Writes/sec: {writes_per_second:.0f}") # ~39
print(f"Reads/sec: {reads_per_second:.0f}") # ~3,900
# High-Level Architecture:
# Client -> Load Balancer -> App Server -> Database
# -> Cache (Redis)
# App Server generates short key using base62 encoding
# Cache stores hot URLs for fast redirects
Quick Reference
System Design Vocabulary
| Term | Definition | Example |
|---|---|---|
| Latency | Time to complete a single request | 200ms to load a page |
| Throughput | Number of requests handled per unit time | 10,000 requests/second |
| Bandwidth | Maximum data transfer rate | 1 Gbps network link |
| SLA | Service Level Agreement -- promised availability | 99.99% uptime guarantee |
| Fault Tolerance | Ability to continue operating despite failures | Replicated databases |
| Redundancy | Duplication of components for reliability | Multiple load balancers |
| Horizontal Scaling | Adding more machines to handle load | Adding more web servers |
| Vertical Scaling | Adding more power to existing machine | Upgrading CPU/RAM |