Databases & Storage | LIZIU System Design

Why Database Choice Matters

The Problem: Your database is the foundation of your system. Choose the wrong database and you will face performance bottlenecks, data inconsistencies, or an inability to scale when you need to.

The Solution: Understand the trade-offs between different database types, indexing strategies, and partitioning approaches so you can make informed decisions for your specific use case.

Real Impact: Uber migrated from PostgreSQL to a custom MySQL-based solution when their data grew beyond what a single relational database could handle efficiently.

Real-World Analogy

Think of databases like different types of filing systems:

SQL databases = A perfectly organized filing cabinet with labeled folders, cross-references, and a strict filing system
NoSQL document stores = A box of file folders -- each folder can contain different things, flexible but less organized
Key-value stores = A dictionary or phone book -- look up by key, get the value instantly
Graph databases = A web of sticky notes connected by strings -- great for relationships

SQL vs NoSQL

SQL vs NoSQL Comparison

Feature	SQL	NoSQL
Schema	Fixed (schema-on-write)	Flexible (schema-on-read)
Scaling	Vertical (scale up)	Horizontal (scale out)
Relationships	JOINs across tables	Denormalized / embedded
Consistency	Strong (ACID)	Eventual (BASE)
Query Language	SQL (standardized)	Varies by database
Best For	Complex queries, transactions	High throughput, flexible data

ACID Properties

ACID guarantees are what make relational databases reliable for transactions like banking, e-commerce, and any operation where data integrity is critical.

Atomicity

All operations in a transaction either complete entirely or not at all. If a bank transfer debits one account, it must credit the other -- no partial updates.

Consistency

Every transaction brings the database from one valid state to another. All constraints, triggers, and rules are satisfied after the transaction completes.

Isolation

Concurrent transactions execute as if they were running sequentially. One transaction cannot see the intermediate state of another.

Durability

Once a transaction is committed, it stays committed -- even if the system crashes. Data is written to non-volatile storage (disk).

acid_transaction.py

import psycopg2

def transfer_money(from_account, to_account, amount):
    """Transfer money between accounts with ACID guarantees."""
    conn = psycopg2.connect(
        host="localhost", database="bank",
        user="admin", password="secret"
    )

    try:
        # Start transaction (Atomicity: all or nothing)
        cursor = conn.cursor()

        # Check sufficient balance (Consistency)
        cursor.execute(
            "SELECT balance FROM accounts WHERE id = %s FOR UPDATE",
            (from_account,)
        )
        balance = cursor.fetchone()[0]

        if balance < amount:
            raise ValueError("Insufficient funds")

        # Debit sender
        cursor.execute(
            "UPDATE accounts SET balance = balance - %s WHERE id = %s",
            (amount, from_account)
        )

        # Credit receiver
        cursor.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s",
            (amount, to_account)
        )

        # Record the transaction
        cursor.execute(
            "INSERT INTO transfers (from_id, to_id, amount) VALUES (%s, %s, %s)",
            (from_account, to_account, amount)
        )

        # Commit (Durability: persisted to disk)
        conn.commit()
        print(f"Transferred ${amount} successfully")

    except Exception as e:
        # Rollback on any error (Atomicity)
        conn.rollback()
        print(f"Transfer failed: {e}")
    finally:
        conn.close()

Database Indexing

An index is a data structure that speeds up data retrieval at the cost of additional storage and slower writes. Without indexes, the database must scan every row in a table (full table scan).

B-Tree Index Structure

indexing_examples.sql

-- Without index: Full table scan O(n)
-- With 10M rows, this scans ALL 10M rows
SELECT * FROM users WHERE email = '[email protected]';

-- Create an index on the email column
CREATE INDEX idx_users_email ON users(email);

-- Now the same query uses O(log n) lookup!
-- With 10M rows: ~24 comparisons vs 10,000,000

-- Composite index for multi-column queries
CREATE INDEX idx_orders_user_date
ON orders(user_id, created_at DESC);

-- This query benefits from the composite index
SELECT * FROM orders
WHERE user_id = 42
ORDER BY created_at DESC
LIMIT 10;

-- Check query execution plan
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = '[email protected]';
-- Without index: Seq Scan, cost=0..1234, time=450ms
-- With index:    Index Scan, cost=0..8, time=0.1ms

Common Pitfall: Over-Indexing

Indexes speed up reads but slow down writes (every INSERT/UPDATE must also update the index). Do not create indexes on every column. Only index columns that appear in WHERE, JOIN, and ORDER BY clauses of frequent queries.

Sharding and Partitioning

When a single database server cannot handle your data volume or traffic, you split the data across multiple servers. This is called sharding (horizontal partitioning).

Database Sharding by User ID

sharding_example.py

class ShardRouter:
    """Route queries to the correct database shard."""

    def __init__(self, shard_connections):
        self.shards = shard_connections  # List of DB connections
        self.num_shards = len(shard_connections)

    def get_shard(self, user_id):
        """Determine which shard holds this user's data."""
        shard_index = user_id % self.num_shards
        return self.shards[shard_index]

    def get_user(self, user_id):
        """Fetch a user from the correct shard."""
        shard = self.get_shard(user_id)
        cursor = shard.cursor()
        cursor.execute(
            "SELECT * FROM users WHERE id = %s",
            (user_id,)
        )
        return cursor.fetchone()

    def create_user(self, user_id, name, email):
        """Insert a user into the correct shard."""
        shard = self.get_shard(user_id)
        cursor = shard.cursor()
        cursor.execute(
            "INSERT INTO users (id, name, email) VALUES (%s, %s, %s)",
            (user_id, name, email)
        )
        shard.commit()

    def query_all_shards(self, query):
        """Fan-out query to all shards (expensive!)."""
        results = []
        for shard in self.shards:
            cursor = shard.cursor()
            cursor.execute(query)
            results.extend(cursor.fetchall())
        return results

# Usage
router = ShardRouter([db_shard_0, db_shard_1, db_shard_2, db_shard_3])
user = router.get_user(42)  # Routes to shard 42 % 4 = shard 2

Practice Problems

Easy Choose the Right Database

For each use case, recommend SQL or NoSQL and which specific database:

An e-commerce platform with complex product relationships
A real-time analytics dashboard ingesting millions of events/second
A social network storing user profiles with varying fields

Consider: structured vs unstructured data, read vs write heavy, consistency requirements, query complexity.

# 1. E-commerce: SQL (PostgreSQL)
#    - Products, orders, users have clear relationships
#    - Need JOINs: orders -> order_items -> products
#    - ACID needed for payment transactions

# 2. Analytics: NoSQL (Apache Cassandra or ClickHouse)
#    - Write-heavy (millions of events/sec)
#    - Time-series data, append-only
#    - Horizontal scaling is critical
#    - Eventual consistency is acceptable

# 3. Social profiles: NoSQL (MongoDB)
#    - Varying fields per profile (flexible schema)
#    - Read-heavy (view profiles often)
#    - Document model maps naturally to profile data
#    - Easy to add new profile fields without migrations

Medium Design a Sharding Strategy

You have a messaging app with 100M users. Each user sends an average of 50 messages/day. Design a sharding strategy for the messages table.

What shard key would you use?
How many shards do you need?
How do you handle conversations between users on different shards?

Think about the most common query pattern: fetching messages for a specific user. Shard by user_id or conversation_id. Consider data locality.

# Sharding strategy for messaging app

# 1. Shard key: conversation_id
#    - Most queries: "get messages in conversation X"
#    - All messages in a conversation on same shard
#    - Avoids cross-shard queries for chat history

# 2. Capacity planning:
messages_per_day = 100_000_000 * 50  # 5B messages/day
msg_size_bytes = 500               # avg message size
daily_data_tb = (messages_per_day * msg_size_bytes) / 1e12
# ~2.5 TB/day, ~912 TB/year

# Per shard: target 500GB-1TB storage
# Need ~16-32 shards initially, grow to 64+
# Use consistent hashing for easy reshard

# 3. Cross-shard conversations:
#    - Each conversation lives on ONE shard
#    - User inbox is a lookup table:
#      user_id -> [conversation_ids + shard locations]
#    - "Get my conversations" = query inbox, then
#      fetch latest message from each shard (parallel)

Medium Index Optimization

Given this query pattern for an orders table, design the optimal indexes:

-- Query 1: Orders by user, sorted by date
SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC;
-- Query 2: Recent orders by status
SELECT * FROM orders WHERE status = 'pending' AND created_at > NOW() - INTERVAL '1 day';
-- Query 3: Order total by user
SELECT user_id, SUM(total) FROM orders GROUP BY user_id;

Use composite indexes that match the query patterns. The column order in a composite index matters -- put equality conditions first, then range/sort conditions.

-- Index for Query 1: user + date (composite)
CREATE INDEX idx_orders_user_date
ON orders(user_id, created_at DESC);
-- Equality on user_id, then sort by created_at

-- Index for Query 2: status + date (composite)
CREATE INDEX idx_orders_status_date
ON orders(status, created_at);
-- Equality on status, then range on created_at

-- Query 3: idx_orders_user_date partially helps
-- For better performance, consider a covering index:
CREATE INDEX idx_orders_user_total
ON orders(user_id) INCLUDE (total);
-- Covers the query without touching the table!

Quick Reference

Database Selection Guide

Database Type	Examples	Best For	Avoid When
Relational (SQL)	PostgreSQL, MySQL	Complex queries, ACID transactions	Massive write throughput needed
Document	MongoDB, CouchDB	Flexible schema, nested data	Heavy cross-document joins
Key-Value	Redis, DynamoDB	Caching, sessions, simple lookups	Complex queries needed
Wide-Column	Cassandra, HBase	Time-series, high write throughput	Need strong consistency
Graph	Neo4j, Neptune	Relationship-heavy data	Simple CRUD operations
Search Engine	Elasticsearch	Full-text search, log analysis	Primary data store

Why Database Choice Matters

Why Database Choice Matters

Real-World Analogy

SQL vs NoSQL

ACID Properties

Atomicity

Consistency

Isolation

Durability

Database Indexing

Common Pitfall: Over-Indexing

Sharding and Partitioning

Practice Problems

Easy Choose the Right Database

Medium Design a Sharding Strategy

Medium Index Optimization

Quick Reference

Database Selection Guide

Related Topics