Databases & Storage

Medium 28 min read

Why Database Choice Matters

Why Database Choice Matters

The Problem: Your database is the foundation of your system. Choose the wrong database and you will face performance bottlenecks, data inconsistencies, or an inability to scale when you need to.

The Solution: Understand the trade-offs between different database types, indexing strategies, and partitioning approaches so you can make informed decisions for your specific use case.

Real Impact: Uber migrated from PostgreSQL to a custom MySQL-based solution when their data grew beyond what a single relational database could handle efficiently.

Real-World Analogy

Think of databases like different types of filing systems:

  • SQL databases = A perfectly organized filing cabinet with labeled folders, cross-references, and a strict filing system
  • NoSQL document stores = A box of file folders -- each folder can contain different things, flexible but less organized
  • Key-value stores = A dictionary or phone book -- look up by key, get the value instantly
  • Graph databases = A web of sticky notes connected by strings -- great for relationships

SQL vs NoSQL

SQL vs NoSQL Comparison
SQL (Relational) id name email 1 Alice [email protected] 2 Bob [email protected] 3 Carol [email protected] Structured tables with fixed schema + Strong consistency (ACID) + Complex queries (JOINs) + Mature ecosystem - Hard to scale horizontally - Rigid schema (migrations) PostgreSQL, MySQL, SQLite NoSQL (Non-Relational) { "id": 1, "name": "Alice", "hobbies": ["code"] } { "id": 2, "name": "Bob", "age": 30 } Flexible documents, no fixed schema + Easy horizontal scaling + Flexible schema + High write throughput - Eventual consistency (usually) - Limited JOIN support MongoDB, Cassandra, DynamoDB
Feature SQL NoSQL
SchemaFixed (schema-on-write)Flexible (schema-on-read)
ScalingVertical (scale up)Horizontal (scale out)
RelationshipsJOINs across tablesDenormalized / embedded
ConsistencyStrong (ACID)Eventual (BASE)
Query LanguageSQL (standardized)Varies by database
Best ForComplex queries, transactionsHigh throughput, flexible data

ACID Properties

ACID guarantees are what make relational databases reliable for transactions like banking, e-commerce, and any operation where data integrity is critical.

Atomicity

All operations in a transaction either complete entirely or not at all. If a bank transfer debits one account, it must credit the other -- no partial updates.

Consistency

Every transaction brings the database from one valid state to another. All constraints, triggers, and rules are satisfied after the transaction completes.

Isolation

Concurrent transactions execute as if they were running sequentially. One transaction cannot see the intermediate state of another.

Durability

Once a transaction is committed, it stays committed -- even if the system crashes. Data is written to non-volatile storage (disk).

acid_transaction.py
import psycopg2

def transfer_money(from_account, to_account, amount):
    """Transfer money between accounts with ACID guarantees."""
    conn = psycopg2.connect(
        host="localhost", database="bank",
        user="admin", password="secret"
    )

    try:
        # Start transaction (Atomicity: all or nothing)
        cursor = conn.cursor()

        # Check sufficient balance (Consistency)
        cursor.execute(
            "SELECT balance FROM accounts WHERE id = %s FOR UPDATE",
            (from_account,)
        )
        balance = cursor.fetchone()[0]

        if balance < amount:
            raise ValueError("Insufficient funds")

        # Debit sender
        cursor.execute(
            "UPDATE accounts SET balance = balance - %s WHERE id = %s",
            (amount, from_account)
        )

        # Credit receiver
        cursor.execute(
            "UPDATE accounts SET balance = balance + %s WHERE id = %s",
            (amount, to_account)
        )

        # Record the transaction
        cursor.execute(
            "INSERT INTO transfers (from_id, to_id, amount) VALUES (%s, %s, %s)",
            (from_account, to_account, amount)
        )

        # Commit (Durability: persisted to disk)
        conn.commit()
        print(f"Transferred ${amount} successfully")

    except Exception as e:
        # Rollback on any error (Atomicity)
        conn.rollback()
        print(f"Transfer failed: {e}")
    finally:
        conn.close()

Database Indexing

An index is a data structure that speeds up data retrieval at the cost of additional storage and slower writes. Without indexes, the database must scan every row in a table (full table scan).

B-Tree Index Structure
[30 | 60] [10 | 20] [40 | 50] [70 | 80] 1,5,8 12,15,18 22,25,28 33,37 42,48 53,58 63,67 75,78 83,90 Search for key 42: Root(30,60) -> Mid(40,50) -> Leaf Only 3 steps! O(log n)
indexing_examples.sql
-- Without index: Full table scan O(n)
-- With 10M rows, this scans ALL 10M rows
SELECT * FROM users WHERE email = '[email protected]';

-- Create an index on the email column
CREATE INDEX idx_users_email ON users(email);

-- Now the same query uses O(log n) lookup!
-- With 10M rows: ~24 comparisons vs 10,000,000

-- Composite index for multi-column queries
CREATE INDEX idx_orders_user_date
ON orders(user_id, created_at DESC);

-- This query benefits from the composite index
SELECT * FROM orders
WHERE user_id = 42
ORDER BY created_at DESC
LIMIT 10;

-- Check query execution plan
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = '[email protected]';
-- Without index: Seq Scan, cost=0..1234, time=450ms
-- With index:    Index Scan, cost=0..8, time=0.1ms

Common Pitfall: Over-Indexing

Indexes speed up reads but slow down writes (every INSERT/UPDATE must also update the index). Do not create indexes on every column. Only index columns that appear in WHERE, JOIN, and ORDER BY clauses of frequent queries.

Sharding and Partitioning

When a single database server cannot handle your data volume or traffic, you split the data across multiple servers. This is called sharding (horizontal partitioning).

Database Sharding by User ID
Application Server Shard Router user_id % 4 Shard 0 user_id % 4 = 0 IDs: 4, 8, 12, 16... ~2.5M users Shard 1 user_id % 4 = 1 IDs: 1, 5, 9, 13... ~2.5M users Shard 2 user_id % 4 = 2 IDs: 2, 6, 10, 14... ~2.5M users Shard 3 user_id % 4 = 3 IDs: 3, 7, 11, 15... ~2.5M users
sharding_example.py
class ShardRouter:
    """Route queries to the correct database shard."""

    def __init__(self, shard_connections):
        self.shards = shard_connections  # List of DB connections
        self.num_shards = len(shard_connections)

    def get_shard(self, user_id):
        """Determine which shard holds this user's data."""
        shard_index = user_id % self.num_shards
        return self.shards[shard_index]

    def get_user(self, user_id):
        """Fetch a user from the correct shard."""
        shard = self.get_shard(user_id)
        cursor = shard.cursor()
        cursor.execute(
            "SELECT * FROM users WHERE id = %s",
            (user_id,)
        )
        return cursor.fetchone()

    def create_user(self, user_id, name, email):
        """Insert a user into the correct shard."""
        shard = self.get_shard(user_id)
        cursor = shard.cursor()
        cursor.execute(
            "INSERT INTO users (id, name, email) VALUES (%s, %s, %s)",
            (user_id, name, email)
        )
        shard.commit()

    def query_all_shards(self, query):
        """Fan-out query to all shards (expensive!)."""
        results = []
        for shard in self.shards:
            cursor = shard.cursor()
            cursor.execute(query)
            results.extend(cursor.fetchall())
        return results

# Usage
router = ShardRouter([db_shard_0, db_shard_1, db_shard_2, db_shard_3])
user = router.get_user(42)  # Routes to shard 42 % 4 = shard 2

Practice Problems

Easy Choose the Right Database

For each use case, recommend SQL or NoSQL and which specific database:

  1. An e-commerce platform with complex product relationships
  2. A real-time analytics dashboard ingesting millions of events/second
  3. A social network storing user profiles with varying fields

Consider: structured vs unstructured data, read vs write heavy, consistency requirements, query complexity.

# 1. E-commerce: SQL (PostgreSQL)
#    - Products, orders, users have clear relationships
#    - Need JOINs: orders -> order_items -> products
#    - ACID needed for payment transactions

# 2. Analytics: NoSQL (Apache Cassandra or ClickHouse)
#    - Write-heavy (millions of events/sec)
#    - Time-series data, append-only
#    - Horizontal scaling is critical
#    - Eventual consistency is acceptable

# 3. Social profiles: NoSQL (MongoDB)
#    - Varying fields per profile (flexible schema)
#    - Read-heavy (view profiles often)
#    - Document model maps naturally to profile data
#    - Easy to add new profile fields without migrations

Medium Design a Sharding Strategy

You have a messaging app with 100M users. Each user sends an average of 50 messages/day. Design a sharding strategy for the messages table.

  1. What shard key would you use?
  2. How many shards do you need?
  3. How do you handle conversations between users on different shards?

Think about the most common query pattern: fetching messages for a specific user. Shard by user_id or conversation_id. Consider data locality.

# Sharding strategy for messaging app

# 1. Shard key: conversation_id
#    - Most queries: "get messages in conversation X"
#    - All messages in a conversation on same shard
#    - Avoids cross-shard queries for chat history

# 2. Capacity planning:
messages_per_day = 100_000_000 * 50  # 5B messages/day
msg_size_bytes = 500               # avg message size
daily_data_tb = (messages_per_day * msg_size_bytes) / 1e12
# ~2.5 TB/day, ~912 TB/year

# Per shard: target 500GB-1TB storage
# Need ~16-32 shards initially, grow to 64+
# Use consistent hashing for easy reshard

# 3. Cross-shard conversations:
#    - Each conversation lives on ONE shard
#    - User inbox is a lookup table:
#      user_id -> [conversation_ids + shard locations]
#    - "Get my conversations" = query inbox, then
#      fetch latest message from each shard (parallel)

Medium Index Optimization

Given this query pattern for an orders table, design the optimal indexes:

-- Query 1: Orders by user, sorted by date
SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC;
-- Query 2: Recent orders by status
SELECT * FROM orders WHERE status = 'pending' AND created_at > NOW() - INTERVAL '1 day';
-- Query 3: Order total by user
SELECT user_id, SUM(total) FROM orders GROUP BY user_id;

Use composite indexes that match the query patterns. The column order in a composite index matters -- put equality conditions first, then range/sort conditions.

-- Index for Query 1: user + date (composite)
CREATE INDEX idx_orders_user_date
ON orders(user_id, created_at DESC);
-- Equality on user_id, then sort by created_at

-- Index for Query 2: status + date (composite)
CREATE INDEX idx_orders_status_date
ON orders(status, created_at);
-- Equality on status, then range on created_at

-- Query 3: idx_orders_user_date partially helps
-- For better performance, consider a covering index:
CREATE INDEX idx_orders_user_total
ON orders(user_id) INCLUDE (total);
-- Covers the query without touching the table!

Quick Reference

Database Selection Guide

Database Type Examples Best For Avoid When
Relational (SQL)PostgreSQL, MySQLComplex queries, ACID transactionsMassive write throughput needed
DocumentMongoDB, CouchDBFlexible schema, nested dataHeavy cross-document joins
Key-ValueRedis, DynamoDBCaching, sessions, simple lookupsComplex queries needed
Wide-ColumnCassandra, HBaseTime-series, high write throughputNeed strong consistency
GraphNeo4j, NeptuneRelationship-heavy dataSimple CRUD operations
Search EngineElasticsearchFull-text search, log analysisPrimary data store