Design a Chat System

Hard 35 min read

Problem Statement & Requirements

Why Chat Systems Are a Classic Interview Question

The Challenge: Real-time messaging requires persistent connections, guaranteed delivery, and the ability to handle millions of concurrent users.

The Complexity: Chat systems combine networking, storage, queueing, and presence management into one cohesive design.

Real Scale: WhatsApp handles 100 billion messages per day with just 50 engineers. Slack serves 12 million daily active users.

Real-World Analogy

Think of a chat system like a postal service for the digital age:

  • WebSocket connection = Having a dedicated phone line to the post office
  • Chat server = The sorting facility that routes messages
  • Message queue = The mail truck holding messages for offline recipients
  • Presence service = The "open/closed" sign on your mailbox
  • Push notification = The doorbell that rings when a package arrives

Functional Requirements

1:1 Messaging

Users can send and receive text messages in real-time. Messages must be delivered in order with at-least-once guarantee.

Group Chat

Support group conversations with up to 500 members. All members receive messages in the same order.

Online Presence

Show whether a user is online, offline, or away. Last-seen timestamp for offline users.

Message History

Persist all messages. Users can load chat history when they open a conversation or switch devices.

Non-Functional Requirements

Key Constraints

  • Low Latency: Messages delivered within 100ms for online users
  • High Availability: 99.99% uptime -- people depend on chat for work and personal life
  • Consistency: Messages must never be lost and must maintain causal ordering
  • Scale: 50 million daily active users, 500 million messages per day
  • Multi-Device: Users can be logged in on phone, tablet, and desktop simultaneously

High-Level Architecture

Chat System High-Level Architecture
Sender Alice WebSocket Gateway Persistent conn management Chat Service Message routing & orchestration Message Queue Kafka / RabbitMQ Message Store Cassandra / HBase Presence Service Redis Push Service APNs / FCM WebSocket Gateway Receiver's connection Receiver Bob WS RPC Forward msg Persist Store Update If offline

WebSocket vs Long Polling

Feature WebSocket Long Polling Server-Sent Events
Direction Full duplex (bidirectional) Client-initiated only Server to client only
Connection Persistent Repeatedly opened/closed Persistent (one-way)
Latency Very low (~50ms) Higher (~300ms+) Low (~100ms)
Server overhead Connection memory Repeated handshakes Connection memory
Best for Chat, gaming, trading Compatibility fallback News feeds, dashboards
WebSocket Connection Lifecycle
Client Server HTTP Upgrade Request 101 Switching Protocols Send: "Hello Bob!" Receive: "Hey Alice!" Send: "How are you?" Ping (heartbeat) Pong
websocket_server.py
import asyncio
import websockets
import json
from collections import defaultdict

# Map: user_id -> set of websocket connections (multi-device)
connections: dict[str, set] = defaultdict(set)

async def handle_connection(websocket, path):
    """Handle a new WebSocket connection."""
    user_id = await authenticate(websocket)
    connections[user_id].add(websocket)

    try:
        async for raw_message in websocket:
            message = json.loads(raw_message)
            await process_message(user_id, message)
    finally:
        connections[user_id].discard(websocket)
        if not connections[user_id]:
            del connections[user_id]
            await update_presence(user_id, "offline")


async def process_message(sender_id: str, message: dict):
    """Route message to recipient(s)."""
    msg_type = message["type"]

    if msg_type == "direct":
        recipient_id = message["to"]
        payload = {
            "from": sender_id,
            "text": message["text"],
            "timestamp": get_timestamp(),
            "msg_id": generate_msg_id()
        }

        # Persist to message store
        await store_message(sender_id, recipient_id, payload)

        # Deliver to online recipient (all devices)
        if recipient_id in connections:
            for ws in connections[recipient_id]:
                await ws.send(json.dumps(payload))
        else:
            # User offline: send push notification
            await send_push_notification(recipient_id, payload)

    elif msg_type == "group":
        await handle_group_message(sender_id, message)


# Start WebSocket server
server = websockets.serve(handle_connection, "0.0.0.0", 8765)
asyncio.get_event_loop().run_until_complete(server)
asyncio.get_event_loop().run_forever()

Message Delivery Flow

Message Delivery: Sender to Receiver
1. Alice sends msg via WebSocket 2. WS Gateway Receives & validates 3. Chat Service Assigns ID + timestamp 4. Write to Queue Kafka partition by chat_id 5. Persist Message Write to Cassandra 6. Lookup Bob's server Session service (Redis) 7. Forward to Bob's WS Push via gateway 8. Bob receives Sends ACK Message Delivery Status Sent Delivered Read Failed (retry)

Message Ordering Guarantee

Problem: With distributed servers, messages can arrive out of order.

Solution: Use a monotonically increasing message ID per conversation. The client re-orders messages locally using this ID. Kafka partitioning by chat_id ensures ordering within a conversation.

Group Chat Fan-Out

Group Message Fan-Out Strategy
Alice sends to group Fan-Out Service 1. Look up members 2. Write to each member's inbox queue Bob's Queue Carol's Queue Dave's Queue Eve's Queue Frank (offline) Bob Carol Dave Eve

Fan-Out on Write vs Fan-Out on Read

Fan-Out on Write (Push Model): When a message is sent, immediately deliver it to all group members' inboxes. Fast reads but expensive writes for large groups.

Fan-Out on Read (Pull Model): Store the message once. When a member opens the group, fetch new messages. Cheap writes but slower reads.

Hybrid: Use push for small groups (under 500 members) and pull for large groups or celebrity accounts.

Online/Offline Status

presence_service.py
import redis
import time

class PresenceService:
    """
    Tracks online/offline status using Redis with TTL.

    Approach: Heartbeat-based presence detection.
    - Client sends heartbeat every 5 seconds
    - If no heartbeat for 30 seconds, user is offline
    - Uses Redis key expiry for automatic cleanup
    """

    def __init__(self):
        self.redis = redis.Redis(host='localhost', port=6379)
        self.HEARTBEAT_TTL = 30  # seconds

    def heartbeat(self, user_id: str):
        """Update user's last-seen timestamp with TTL."""
        key = f"presence:{user_id}"
        self.redis.setex(key, self.HEARTBEAT_TTL, str(time.time()))

    def is_online(self, user_id: str) -> bool:
        """Check if user has an active heartbeat."""
        return self.redis.exists(f"presence:{user_id}") > 0

    def get_last_seen(self, user_id: str) -> float:
        """Get user's last heartbeat timestamp."""
        ts = self.redis.get(f"presence:{user_id}")
        return float(ts) if ts else 0

    def get_online_friends(self, user_id: str) -> list:
        """Get list of online friends using pipeline for efficiency."""
        friends = get_friends_list(user_id)
        pipe = self.redis.pipeline()
        for friend_id in friends:
            pipe.exists(f"presence:{friend_id}")
        results = pipe.execute()
        return [f for f, online in zip(friends, results) if online]

Presence at Scale: The Hard Problem

With 50M users, broadcasting status changes to all friends creates a storm of events. Solutions:

  • Lazy updates: Only fetch presence when a user opens a chat
  • Pub/Sub channels: Subscribe to friends' presence changes only while app is active
  • Batching: Aggregate status changes and push every 10 seconds instead of instantly

Message Storage & Retrieval

Why Cassandra for Messages?

Chat messages have a write-heavy workload with sequential reads by conversation. Cassandra excels at this pattern with its LSM-tree storage engine and partition-key-based data locality.

message_schema.cql
-- Cassandra schema for chat messages
-- Partition key: chat_id (groups messages by conversation)
-- Clustering key: msg_id (sorts messages within conversation)

CREATE TABLE messages (
    chat_id     UUID,
    msg_id      TIMEUUID,
    sender_id   UUID,
    content     TEXT,
    msg_type    TEXT,      -- 'text', 'image', 'file'
    created_at  TIMESTAMP,
    PRIMARY KEY (chat_id, msg_id)
) WITH CLUSTERING ORDER BY (msg_id DESC);

-- Efficient query: "Get latest 50 messages in chat"
-- SELECT * FROM messages WHERE chat_id = ? LIMIT 50;

-- User's chat list (most recent message per chat)
CREATE TABLE user_chats (
    user_id     UUID,
    chat_id     UUID,
    last_msg    TEXT,
    last_time   TIMESTAMP,
    unread      INT,
    PRIMARY KEY (user_id, last_time)
) WITH CLUSTERING ORDER BY (last_time DESC);

Push Notifications

Push Notification Flow

  • Step 1: Chat service detects recipient is offline (no active WebSocket)
  • Step 2: Publish notification event to a dedicated push queue
  • Step 3: Push service looks up device tokens from the user's device registry
  • Step 4: Send via APNs (iOS) or FCM (Android) with message preview
  • Step 5: Handle token invalidation (uninstalled apps) and rate limiting

Push Notification Pitfalls

Notification storms: In active group chats, batch notifications instead of sending one per message. Show "5 new messages in Team Chat" instead of five separate alerts.

Badge counts: Maintain an atomic unread counter per user in Redis. Increment on message delivery, reset when user opens the chat.

Practice Problems

Medium End-to-End Encryption

Design end-to-end encryption for the chat system so that even the server cannot read messages. Consider:

  1. How do you exchange encryption keys between users?
  2. How does E2E encryption work with multi-device support?
  3. How do you handle group chat encryption?

Look into the Signal Protocol (Double Ratchet Algorithm). Each device has its own key pair. For group chats, consider the Sender Key protocol where the sender shares a symmetric key with all group members.

# E2E Encryption Architecture (Signal Protocol)

# 1. Key Exchange (X3DH - Extended Triple Diffie-Hellman)
# - Each user registers public keys on the server
# - Identity Key (long-term), Signed Pre-Key, One-Time Pre-Keys
# - Sender fetches recipient's key bundle to establish session

# 2. Multi-device: Each device has its own key pair
# - Message is encrypted separately for each device
# - Server stores encrypted copies per device

# 3. Group Chat (Sender Keys Protocol)
# - Sender generates a symmetric "sender key"
# - Distributes sender key to all members (E2E encrypted)
# - All messages encrypted with sender key (efficient)
# - Rotate sender key when members join/leave

# Server stores only ciphertext -- cannot decrypt

Hard Message Search

Add full-text search to the chat system so users can search across all their conversations. How would you index messages while maintaining E2E encryption?

  1. What search engine would you use?
  2. How do you keep the search index in sync?
  3. Can you search encrypted messages?

For non-E2E: use Elasticsearch fed by a CDC pipeline from Cassandra. For E2E: search must happen client-side. Consider building a local search index on the device.

# Non-E2E Search: Server-side Elasticsearch
# - CDC (Change Data Capture) from Cassandra to Kafka
# - Kafka consumer indexes into Elasticsearch
# - Search API: GET /api/search?q=meeting&chat_id=xxx
# - Access control: only search user's own chats

# E2E Search: Client-side indexing
# - Build inverted index locally (SQLite FTS5)
# - Index decrypted messages on-device
# - Sync encrypted index blob across devices
# - Tradeoff: limited to messages stored on device

Hard Media Messages

Extend the chat system to support sending images and videos up to 100MB. How would you handle upload, storage, thumbnails, and delivery?

  1. Where do you store media files?
  2. How do you generate thumbnails for preview?
  3. How do you handle large file uploads reliably?

Use pre-signed URLs for direct upload to object storage (S3). Generate thumbnails asynchronously. Store only the media URL in the message record, not the file itself.

# Media Message Flow
# 1. Client requests upload URL: POST /api/media/upload
# 2. Server returns pre-signed S3 URL (15 min expiry)
# 3. Client uploads directly to S3 (chunked for large files)
# 4. S3 triggers Lambda for thumbnail generation
# 5. Client sends message with media_url reference
# 6. Recipient downloads media via CDN

# Storage: S3 with lifecycle policies
# - Hot: S3 Standard (first 30 days)
# - Warm: S3 IA (30-90 days)
# - Cold: S3 Glacier (90+ days)

# Thumbnails: Lambda + Sharp.js
# - Generate 150x150, 300x300, 600x600 variants
# - Store in separate S3 prefix: /thumbnails/

Quick Reference

Chat System Design Summary

Component Technology Rationale
Real-time Transport WebSocket Full duplex, low latency, persistent
Message Queue Kafka Ordered delivery, replay, high throughput
Message Storage Cassandra Write-optimized, partition by chat_id
Presence Redis TTL-based heartbeat, sub-ms reads
Session Store Redis Map user_id to WS gateway server
Push APNs + FCM Native mobile notifications
Media Storage S3 + CDN Scalable blob storage, global delivery

Key Takeaways

Interview Tips

  • Start by clarifying: 1:1 only or group chat? Text only or media? How many users?
  • WebSocket is the correct choice for real-time chat -- explain why HTTP polling falls short
  • Message ordering is subtle: use per-conversation sequence numbers, not wall-clock time
  • Separate the concerns: WS gateway, chat service, presence, push are all independent services
  • For group chat, discuss fan-out-on-write vs fan-out-on-read tradeoffs
  • Don't forget offline message delivery and multi-device synchronization