Data Replication & Consistency

Hard 28 min read

Why Replication Matters

Why This Matters

The Problem: A single database server is a single point of failure. If it goes down, your entire application is unavailable. It also limits read throughput and geographic reach.

The Solution: Replication keeps copies of data on multiple machines, providing high availability, fault tolerance, and improved read performance.

Real Impact: Amazon's DynamoDB replicates data across three availability zones, achieving 99.999% availability. That means less than 5 minutes of downtime per year.

Real-World Analogy

Think of data replication like a chain of bookstores:

  • Leader-Follower: One flagship store (leader) receives all new books first. Branch stores (followers) get copies. Customers can browse at any branch, but new book orders go to the flagship.
  • Multi-Leader: Several flagship stores in different cities, each can accept new books. They synchronize inventory periodically, but sometimes conflicts arise (two stores ordering different quantities).
  • Leaderless: No flagship. Any store can accept new books. When a customer asks for a book, they check multiple stores and take the newest version.

Reasons to Replicate Data

High Availability

If one replica fails, others continue serving requests. No single point of failure. Critical for 99.99%+ uptime SLAs.

Read Scalability

Distribute read queries across replicas. One leader handles writes, many followers handle reads. Great for read-heavy workloads (100:1 read/write ratio).

Geographic Proximity

Place replicas close to users around the world. A user in Tokyo reads from a Tokyo replica instead of crossing the Pacific to reach a US server.

Disaster Recovery

If an entire data center is destroyed, replicas in other regions have your data. Recovery Point Objective (RPO) can be near zero.

Leader-Follower Replication

The most common replication strategy. One node (the leader) accepts all writes and propagates changes to followers. Followers are read-only replicas.

Leader-Follower Replication Flow
Client WRITE Leader Read + Write replication replication replication Follower 1 Read Only Follower 2 Read Only Follower 3 Read Only Reader 1 Reader 2 Reader 3 Write path Async replication Read path

Synchronous vs Asynchronous Replication

Aspect Synchronous Asynchronous
Write confirmed when Leader + follower(s) acknowledge Leader acknowledges immediately
Consistency Strong - followers always up-to-date Eventual - followers may lag
Write latency Higher (must wait for replicas) Lower (fire-and-forget)
Availability Blocked if follower is down Leader can continue alone
Data loss risk No data loss on leader failure Recent writes may be lost
Used by Financial systems, inventory Most web applications

Replication Lag Problem

Scenario: A user updates their profile (write goes to leader), then immediately views their profile (read hits a follower that hasn't replicated yet). They see stale data!

Solutions:

  • Read-your-writes consistency: After a write, route subsequent reads to the leader (for that user only)
  • Monotonic reads: Ensure a user always reads from the same replica
  • Consistent prefix reads: If write A happened before B, readers see A before B

Multi-Leader Replication

Multiple nodes accept writes. Each leader replicates to all other leaders. Used for multi-datacenter setups where you need local write performance.

When to Use Multi-Leader

Multi-datacenter operation (each DC has a leader), offline-capable applications (phone acts as a leader), collaborative editing (Google Docs).

The Write Conflict Problem

Two leaders accept conflicting writes simultaneously (e.g., two users edit the same row). You need a conflict resolution strategy.

Conflict Resolution Strategies

Strategy How It Works Pros / Cons
Last-Writer-Wins (LWW) Attach a timestamp; highest timestamp wins Simple but can lose data silently
Merge Values Combine conflicting values (e.g., union of sets) Application-specific; preserves data
Custom Handler Application code resolves conflicts on read or write Most flexible but complex
CRDTs Conflict-free Replicated Data Types that merge automatically Elegant but limited data structures

Leaderless Replication

No designated leader. Any replica can accept reads and writes. The client sends writes to multiple replicas in parallel. Used by Amazon Dynamo, Apache Cassandra, and Riak.

How Leaderless Writes Work

The client sends a write to all N replicas simultaneously. The write is considered successful when W replicas acknowledge it (W is the write quorum). Similarly, reads query R replicas and take the most recent value.

Consistency Models

Model Guarantee Performance Example Systems
Strong (Linearizable) Every read returns the most recent write. Appears as if there is one copy. Slowest - requires coordination ZooKeeper, Spanner, etcd
Sequential All nodes see operations in the same order (not necessarily real-time) Fast - no real-time requirement ZooKeeper (for writes)
Causal If operation A caused operation B, everyone sees A before B Good balance of consistency and speed MongoDB (causal sessions)
Eventual All replicas converge to the same value eventually. Reads may return stale data temporarily. Fastest - no coordination needed DynamoDB, Cassandra, S3

CAP Theorem Connection

The CAP theorem states that in a network partition, you must choose between Consistency (C) and Availability (A). Strong consistency requires sacrificing availability during partitions (CP systems like ZooKeeper). Eventual consistency favors availability (AP systems like Cassandra). In practice, most systems let you tune this trade-off per query.

Quorum Reads and Writes

In a leaderless system with N replicas, a write quorum W and read quorum R ensure consistency when W + R > N. This guarantees that at least one node in a read overlaps with the nodes that acknowledged the write.

Quorum Read/Write (N=3, W=2, R=2)
Write (W=2 of N=3) Client R1 OK R2 OK R3 DOWN Write succeeds (2 >= W=2) Read (R=2 of N=3) Client R1 v2 (new) R2 v1 (old) R3 DOWN Read returns v2 (newest of 2 responses) W + R > N ensures overlap: 2 + 2 > 3

Common Quorum Configurations

Configuration N W R Use Case
Standard 3 2 2 Balanced read/write consistency
Write-heavy 3 1 3 Fast writes, reads check all replicas
Read-heavy 3 3 1 Slow writes (all must ack), fast reads from any
Large cluster 5 3 3 Tolerates 2 node failures
replication_simulator.py
# Simulating quorum-based replication
import time
import random
from dataclasses import dataclass, field
from typing import Optional, Dict, List

@dataclass
class VersionedValue:
    value: str
    version: int
    timestamp: float = field(default_factory=time.time)

class Replica:
    def __init__(self, replica_id: str, is_healthy=True):
        self.replica_id = replica_id
        self.is_healthy = is_healthy
        self.store: Dict[str, VersionedValue] = {}

    def write(self, key: str, value: str,
              version: int) -> bool:
        if not self.is_healthy:
            raise ConnectionError(f"{self.replica_id} is down")
        # Only accept newer versions
        existing = self.store.get(key)
        if existing and existing.version >= version:
            return False
        self.store[key] = VersionedValue(value, version)
        return True

    def read(self, key: str) -> Optional[VersionedValue]:
        if not self.is_healthy:
            raise ConnectionError(f"{self.replica_id} is down")
        return self.store.get(key)

class QuorumStore:
    def __init__(self, n=3, w=2, r=2):
        self.n = n
        self.w = w
        self.r = r
        self.replicas = [Replica(f"R{i}") for i in range(n)]
        self.version_counter = 0

    def write(self, key: str, value: str) -> bool:
        """Write to W replicas for quorum."""
        self.version_counter += 1
        acks = 0
        for replica in self.replicas:
            try:
                if replica.write(key, value, self.version_counter):
                    acks += 1
            except ConnectionError:
                continue
        success = acks >= self.w
        print(f"Write '{key}'='{value}' v{self.version_counter}: "
              f"{acks}/{self.n} acks ({'OK' if success else 'FAIL'})")
        return success

    def read(self, key: str) -> Optional[str]:
        """Read from R replicas, return newest."""
        responses: List[VersionedValue] = []
        for replica in self.replicas:
            try:
                val = replica.read(key)
                if val:
                    responses.append(val)
            except ConnectionError:
                continue

        if len(responses) < self.r:
            print(f"Read '{key}': quorum not met")
            return None

        # Return the value with highest version
        newest = max(responses, key=lambda v: v.version)
        print(f"Read '{key}': '{newest.value}' (v{newest.version})")
        return newest.value

# Demo
store = QuorumStore(n=3, w=2, r=2)
store.write("user:1", "Alice")
store.read("user:1")

# Simulate a node failure
store.replicas[2].is_healthy = False
store.write("user:1", "Alice Updated")  # Still succeeds (2 acks)
store.read("user:1")   # Still succeeds (2 responses)

Practice Problems

Medium Replication Lag Scenario

You have a leader-follower setup with 3 followers. A user updates their email, then immediately reads their profile. They see the old email. How do you fix this?

  1. Explain why this happens
  2. Propose two different solutions
  3. Discuss the trade-offs of each solution

Think about "read-your-writes" consistency. Where should the read go? Can you track the write timestamp?

# Why it happens:
# Write goes to leader, but read hits a follower
# that hasn't received the replication yet.

# Solution 1: Read-your-writes consistency
# After a user writes, route their reads to the
# leader for a short window (e.g., 10 seconds).
def get_read_node(user_id, last_write_time):
    if time.time() - last_write_time < 10:
        return leader  # Read from leader
    return random.choice(followers)

# Solution 2: Version-based routing
# Track the write version in the client session.
# Only read from replicas that have caught up.
def read_with_min_version(key, min_version):
    for replica in replicas:
        if replica.version >= min_version:
            return replica.read(key)
    return leader.read(key)  # Fallback

# Trade-offs:
# Solution 1: Simple, but increases leader load
# Solution 2: More complex, but keeps load balanced

Hard Quorum Configuration

You have a Cassandra cluster with 5 nodes (N=5). Design quorum configurations for:

  1. A social media "likes" counter (writes much more frequent than reads)
  2. A banking transaction ledger (strong consistency required)
  3. A user session store (fast reads, tolerance for slight staleness)

Remember: W + R > N guarantees consistency. For strong consistency, maximize overlap. For speed, minimize the quorum for your hot path (reads or writes).

# N=5 for all configurations

# 1. Likes counter (write-heavy)
# W=1, R=5: Writes are fast (only 1 ack needed)
# Reads check all nodes for latest value
# Acceptable: slight delay in like count display
# W + R = 6 > 5 (consistent when all nodes up)

# 2. Banking ledger (strong consistency)
# W=3, R=3: Majority quorum for both
# W + R = 6 > 5 (always consistent)
# Tolerates 2 node failures for reads/writes
# Slower but guarantees no stale reads

# 3. Session store (read-heavy)
# W=3, R=1: Fast reads from any single node
# W + R = 4 < 5 (NOT strictly consistent!)
# Better: W=5, R=1 for consistency
# Or accept eventual consistency: W=2, R=1

Hard Conflict Resolution

In a multi-leader setup, User A in New York and User B in London simultaneously edit the same document title. User A changes it to "Project Alpha" and User B changes it to "Project Beta". Design a conflict resolution strategy that:

  1. Does not silently lose either edit
  2. Provides a deterministic resolution across all replicas
  3. Allows human review of conflicts

Consider storing both versions as "siblings" (like Riak). Use vector clocks to detect conflicts. Present conflicts to the user on the next access.

from dataclasses import dataclass
from typing import List, Dict

@dataclass
class ConflictRecord:
    key: str
    siblings: List[Dict]  # All conflicting values
    resolved: bool = False

class ConflictResolver:
    def detect(self, v1, v2):
        # Use vector clocks to detect
        # concurrent writes (neither dominates)
        if v1.clock.concurrent_with(v2.clock):
            return ConflictRecord(
                key=v1.key,
                siblings=[
                    {"value": v1.value, "origin": "NYC"},
                    {"value": v2.value, "origin": "LON"},
                ]
            )
        # One dominates -> no conflict
        return None

    def auto_resolve(self, conflict):
        # Deterministic: sort by origin, pick first
        # Both replicas will reach same result
        sorted_siblings = sorted(
            conflict.siblings,
            key=lambda s: s["origin"]
        )
        return sorted_siblings[0]["value"]

    def human_resolve(self, conflict, chosen_idx):
        # User picks which version to keep
        conflict.resolved = True
        return conflict.siblings[chosen_idx]["value"]

Quick Reference

Replication Strategy Comparison

Strategy Write Perf Consistency Fault Tolerance Complexity
Leader-Follower Good (single leader) Strong (sync) / Eventual (async) Failover needed Low
Multi-Leader Great (local writes) Eventual + conflicts High High
Leaderless Good (parallel writes) Tunable (W+R>N) Very high Medium

Key Formulas

Essential Replication Formulas

  • Quorum condition: W + R > N (guarantees at least one overlapping node)
  • Write availability: Tolerates N - W node failures for writes
  • Read availability: Tolerates N - R node failures for reads
  • Strong consistency: W = N and R = 1, or W = 1 and R = N, or majority quorums
  • Sloppy quorum: W + R > N may not hold during network partitions (Dynamo-style)