System Design Fundamentals

Easy 20 min read

What is System Design?

Why System Design Matters

The Problem: Building software that works on your laptop is easy. Building software that serves millions of users reliably, efficiently, and at scale is an entirely different challenge.

The Solution: System design gives you the principles, patterns, and building blocks to architect software systems that are reliable, scalable, and maintainable.

Real Impact: Every major tech company -- from Google to Netflix to Amazon -- relies on solid system design to serve billions of requests per day.

Real-World Analogy

Think of system design like city planning:

  • Roads = Network connections between services
  • Traffic lights = Load balancers distributing traffic
  • Warehouses = Databases storing your data
  • Post offices = Message queues handling async communication
  • Building codes = Design principles ensuring reliability

Just as a city needs thoughtful planning to handle growth, your software systems need careful design to handle increasing users and data.

What You Will Learn

System design covers a broad set of topics. In this tutorial series, you will learn how to think about and build large-scale distributed systems. Here is a roadmap of what we will cover:

Fundamentals

Client-server model, design principles, networking basics, and the vocabulary you need to discuss systems.

Building Blocks

Databases, caches, load balancers, CDNs, message queues, and other core infrastructure components.

Patterns & Strategies

Microservices, data replication, rate limiting, and proven approaches to common challenges.

Real-World Designs

End-to-end design of systems like URL shorteners, chat apps, news feeds, and video streaming platforms.

Client-Server Architecture

The client-server model is the foundation of virtually every modern web application. A client (like your web browser or mobile app) sends requests to a server, which processes those requests and returns responses.

Client-Server Architecture
Clients Web Browser Mobile App API Client HTTP Request HTTP Response Internet Server Web Server (Nginx) Application Server Database Cache (Redis) Request/Response Cycle

How a Request Flows

Step-by-Step Request Flow

  1. DNS Resolution: The client resolves the domain name (e.g., api.example.com) to an IP address
  2. TCP Connection: A TCP connection is established via a three-way handshake
  3. HTTP Request: The client sends an HTTP request (GET, POST, PUT, DELETE)
  4. Server Processing: The server receives the request, processes it, queries the database if needed
  5. HTTP Response: The server sends back a response with a status code and data
  6. Client Rendering: The client receives the response and renders it for the user
simple_server.py
from http.server import HTTPServer, BaseHTTPRequestHandler
import json

class SimpleHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        # Set response status code
        self.send_response(200)
        self.send_header("Content-Type", "application/json")
        self.end_headers()

        # Create response body
        response = {
            "message": "Hello from the server!",
            "status": "healthy",
            "version": "1.0.0"
        }

        # Send JSON response
        self.wfile.write(json.dumps(response).encode())

# Start the server on port 8080
server = HTTPServer(("0.0.0.0", 8080), SimpleHandler)
print("Server running on http://localhost:8080")
server.serve_forever()
client_request.py
import requests

# Send a GET request to the server
response = requests.get("http://localhost:8080")

# Check the status code
print(f"Status Code: {response.status_code}")
print(f"Response: {response.json()}")

# Output:
# Status Code: 200
# Response: {'message': 'Hello from the server!', 'status': 'healthy', 'version': '1.0.0'}

Key Design Principles

Every well-designed system balances four fundamental properties. Understanding these trade-offs is at the heart of system design.

Reliability

The system continues to work correctly even when things go wrong -- hardware faults, software bugs, or human errors. A reliable system is fault-tolerant and resilient.

Availability

The system is accessible and operational when users need it. Measured in "nines" -- 99.9% availability means at most 8.76 hours of downtime per year.

Scalability

The system can handle growth -- whether in data volume, traffic, or complexity -- without degrading performance. Scale up (bigger machines) or scale out (more machines).

Maintainability

The system is easy to operate, understand, and evolve. Good maintainability means new engineers can quickly become productive and changes are easy to make safely.

Availability Nines

Availability Uptime % Downtime per Year Downtime per Month
Two nines 99% 3.65 days 7.31 hours
Three nines 99.9% 8.76 hours 43.83 minutes
Four nines 99.99% 52.6 minutes 4.38 minutes
Five nines 99.999% 5.26 minutes 26.3 seconds

Common Pitfall: Ignoring Trade-Offs

You cannot maximize all four properties simultaneously. Every design decision involves trade-offs:

  • Consistency vs. Availability: The CAP theorem states you can only guarantee two of three: Consistency, Availability, Partition tolerance
  • Latency vs. Throughput: Optimizing for one often comes at the expense of the other
  • Simplicity vs. Flexibility: More features mean more complexity

The System Design Interview Framework

Whether you are designing a system at work or tackling a system design interview, following a structured framework ensures you cover all the important aspects.

The RESHADED Framework

Use this step-by-step approach for any system design problem:

1. Requirements Clarification

Ask questions to understand functional and non-functional requirements. Who are the users? What are the core features? What scale are we targeting?

2. Estimation

Do back-of-the-envelope calculations. How many users? How much data? What throughput? These numbers drive your design decisions.

3. Storage Schema

Define your data model. What entities do you need? What are the relationships? SQL or NoSQL? How will data grow over time?

4. High-Level Design

Draw the big picture -- clients, servers, databases, caches, queues. Show how components connect and data flows through the system.

5. API Design

Define the interface. What endpoints do you need? What are the request/response formats? REST, GraphQL, or gRPC?

6. Detailed Design

Dive deep into critical components. How does the caching layer work? How do you handle failures? What algorithms power the core features?

7. Evaluate & Optimize

Identify bottlenecks. How does the system handle edge cases? What are single points of failure? How would you monitor and alert?

8. Distinguish Yourself

Discuss trade-offs you made and alternatives you considered. Show depth by addressing security, cost optimization, or regional deployment.

Back-of-the-Envelope Estimation

Key Numbers Every Engineer Should Know

  • L1 cache reference: ~0.5 ns
  • L2 cache reference: ~7 ns
  • Main memory reference: ~100 ns
  • SSD random read: ~150 us
  • HDD seek: ~10 ms
  • Round trip within same datacenter: ~0.5 ms
  • Round trip CA to Netherlands: ~150 ms
estimation_example.py
# Back-of-the-envelope estimation for a Twitter-like service

# User assumptions
total_users = 500_000_000       # 500M total users
daily_active_users = 200_000_000 # 200M DAU (40%)
tweets_per_user_per_day = 2     # average tweets/day

# Write throughput
tweets_per_day = daily_active_users * tweets_per_user_per_day
tweets_per_second = tweets_per_day / (24 * 3600)
print(f"Tweets/day: {tweets_per_day:,.0f}")
print(f"Tweets/second: {tweets_per_second:,.0f}")

# Storage estimation
avg_tweet_size_bytes = 300  # text + metadata
daily_storage_gb = (tweets_per_day * avg_tweet_size_bytes) / (1024**3)
yearly_storage_tb = daily_storage_gb * 365 / 1024
print(f"Daily storage: {daily_storage_gb:.1f} GB")
print(f"Yearly storage: {yearly_storage_tb:.1f} TB")

# Read throughput (reads >> writes, ~100:1 ratio)
reads_per_second = tweets_per_second * 100
print(f"Reads/second: {reads_per_second:,.0f}")

# Output:
# Tweets/day: 400,000,000
# Tweets/second: 4,630
# Daily storage: 111.8 GB
# Yearly storage: 39.9 TB
# Reads/second: 463,000

Practice Problems

Easy Identify the Components

For a simple e-commerce website, identify:

  1. What are the clients?
  2. What server-side components would you need?
  3. What data needs to be stored?
  4. What are the key design principles to prioritize?

Think about the different types of users (shoppers, admins) and the different types of data (products, orders, users). Consider which design principles matter most for an e-commerce platform.

# Clients:
# - Web browsers (desktop shoppers)
# - Mobile apps (iOS/Android)
# - Admin dashboard

# Server-side components:
# - Web server (Nginx) for static files
# - Application server (product catalog, cart, checkout)
# - Database (products, users, orders)
# - Cache (popular products, session data)
# - Payment processing service
# - Search service (product search)

# Data to store:
# - Product catalog (name, price, description, images)
# - User accounts (email, password hash, addresses)
# - Orders (items, totals, shipping status)
# - Shopping cart state
# - Reviews and ratings

# Key principles:
# - Availability (store must be accessible 24/7)
# - Reliability (orders must never be lost)
# - Scalability (handle Black Friday traffic spikes)

Easy Estimation Practice

Estimate the storage requirements for a photo-sharing service:

  1. 100 million daily active users
  2. Each user uploads 2 photos per day on average
  3. Average photo size is 2 MB
  4. How much storage do you need per day? Per year?

Multiply users * photos/day * size/photo. Remember to account for thumbnails and multiple resolutions. Convert bytes to TB for readability.

# Storage estimation
dau = 100_000_000          # 100M daily active users
photos_per_day = 2          # average uploads/user
avg_photo_mb = 2            # MB per photo

# Daily storage
daily_photos = dau * photos_per_day  # 200M photos
daily_storage_tb = (daily_photos * avg_photo_mb) / (1024 * 1024)
print(f"Daily: {daily_storage_tb:.0f} TB")  # ~381 TB

# Yearly storage
yearly_storage_pb = daily_storage_tb * 365 / 1024
print(f"Yearly: {yearly_storage_pb:.1f} PB")  # ~136 PB

# With thumbnails (3 sizes: small, medium, large)
# Roughly 1.3x overhead for extra sizes
total_yearly_pb = yearly_storage_pb * 1.3
print(f"With thumbnails: {total_yearly_pb:.1f} PB")

Medium Design a Simple URL Shortener

Using the interview framework, outline the high-level design for a URL shortener:

  1. List the functional requirements
  2. List the non-functional requirements
  3. Estimate the scale (assume 100M URLs created per month)
  4. Draw a high-level architecture

Functional: shorten URL, redirect, optional custom alias. Non-functional: low latency redirects, high availability. For estimation, calculate reads vs writes (100:1 ratio is typical).

# Functional Requirements:
# 1. Given a long URL, generate a short URL
# 2. Given a short URL, redirect to the original
# 3. Optional: custom short URLs
# 4. Optional: analytics (click count, referrers)

# Non-Functional Requirements:
# 1. Low latency redirects (< 100ms)
# 2. High availability (99.99%)
# 3. Short URLs should not be guessable

# Scale Estimation:
writes_per_month = 100_000_000     # 100M new URLs/month
writes_per_second = writes_per_month / (30 * 24 * 3600)
reads_per_second = writes_per_second * 100

print(f"Writes/sec: {writes_per_second:.0f}")   # ~39
print(f"Reads/sec: {reads_per_second:.0f}")    # ~3,900

# High-Level Architecture:
# Client -> Load Balancer -> App Server -> Database
#                                      -> Cache (Redis)
# App Server generates short key using base62 encoding
# Cache stores hot URLs for fast redirects

Quick Reference

System Design Vocabulary

Term Definition Example
Latency Time to complete a single request 200ms to load a page
Throughput Number of requests handled per unit time 10,000 requests/second
Bandwidth Maximum data transfer rate 1 Gbps network link
SLA Service Level Agreement -- promised availability 99.99% uptime guarantee
Fault Tolerance Ability to continue operating despite failures Replicated databases
Redundancy Duplication of components for reliability Multiple load balancers
Horizontal Scaling Adding more machines to handle load Adding more web servers
Vertical Scaling Adding more power to existing machine Upgrading CPU/RAM

Useful Resources