Problem Statement & Requirements
Why Chat Systems Are a Classic Interview Question
The Challenge: Real-time messaging requires persistent connections, guaranteed delivery, and the ability to handle millions of concurrent users.
The Complexity: Chat systems combine networking, storage, queueing, and presence management into one cohesive design.
Real Scale: WhatsApp handles 100 billion messages per day with just 50 engineers. Slack serves 12 million daily active users.
Real-World Analogy
Think of a chat system like a postal service for the digital age:
- WebSocket connection = Having a dedicated phone line to the post office
- Chat server = The sorting facility that routes messages
- Message queue = The mail truck holding messages for offline recipients
- Presence service = The "open/closed" sign on your mailbox
- Push notification = The doorbell that rings when a package arrives
Functional Requirements
1:1 Messaging
Users can send and receive text messages in real-time. Messages must be delivered in order with at-least-once guarantee.
Group Chat
Support group conversations with up to 500 members. All members receive messages in the same order.
Online Presence
Show whether a user is online, offline, or away. Last-seen timestamp for offline users.
Message History
Persist all messages. Users can load chat history when they open a conversation or switch devices.
Non-Functional Requirements
Key Constraints
- Low Latency: Messages delivered within 100ms for online users
- High Availability: 99.99% uptime -- people depend on chat for work and personal life
- Consistency: Messages must never be lost and must maintain causal ordering
- Scale: 50 million daily active users, 500 million messages per day
- Multi-Device: Users can be logged in on phone, tablet, and desktop simultaneously
High-Level Architecture
WebSocket vs Long Polling
| Feature | WebSocket | Long Polling | Server-Sent Events |
|---|---|---|---|
| Direction | Full duplex (bidirectional) | Client-initiated only | Server to client only |
| Connection | Persistent | Repeatedly opened/closed | Persistent (one-way) |
| Latency | Very low (~50ms) | Higher (~300ms+) | Low (~100ms) |
| Server overhead | Connection memory | Repeated handshakes | Connection memory |
| Best for | Chat, gaming, trading | Compatibility fallback | News feeds, dashboards |
import asyncio
import websockets
import json
from collections import defaultdict
# Map: user_id -> set of websocket connections (multi-device)
connections: dict[str, set] = defaultdict(set)
async def handle_connection(websocket, path):
"""Handle a new WebSocket connection."""
user_id = await authenticate(websocket)
connections[user_id].add(websocket)
try:
async for raw_message in websocket:
message = json.loads(raw_message)
await process_message(user_id, message)
finally:
connections[user_id].discard(websocket)
if not connections[user_id]:
del connections[user_id]
await update_presence(user_id, "offline")
async def process_message(sender_id: str, message: dict):
"""Route message to recipient(s)."""
msg_type = message["type"]
if msg_type == "direct":
recipient_id = message["to"]
payload = {
"from": sender_id,
"text": message["text"],
"timestamp": get_timestamp(),
"msg_id": generate_msg_id()
}
# Persist to message store
await store_message(sender_id, recipient_id, payload)
# Deliver to online recipient (all devices)
if recipient_id in connections:
for ws in connections[recipient_id]:
await ws.send(json.dumps(payload))
else:
# User offline: send push notification
await send_push_notification(recipient_id, payload)
elif msg_type == "group":
await handle_group_message(sender_id, message)
# Start WebSocket server
server = websockets.serve(handle_connection, "0.0.0.0", 8765)
asyncio.get_event_loop().run_until_complete(server)
asyncio.get_event_loop().run_forever()
Message Delivery Flow
Message Ordering Guarantee
Problem: With distributed servers, messages can arrive out of order.
Solution: Use a monotonically increasing message ID per conversation. The client re-orders messages locally using this ID. Kafka partitioning by chat_id ensures ordering within a conversation.
Group Chat Fan-Out
Fan-Out on Write vs Fan-Out on Read
Fan-Out on Write (Push Model): When a message is sent, immediately deliver it to all group members' inboxes. Fast reads but expensive writes for large groups.
Fan-Out on Read (Pull Model): Store the message once. When a member opens the group, fetch new messages. Cheap writes but slower reads.
Hybrid: Use push for small groups (under 500 members) and pull for large groups or celebrity accounts.
Online/Offline Status
import redis
import time
class PresenceService:
"""
Tracks online/offline status using Redis with TTL.
Approach: Heartbeat-based presence detection.
- Client sends heartbeat every 5 seconds
- If no heartbeat for 30 seconds, user is offline
- Uses Redis key expiry for automatic cleanup
"""
def __init__(self):
self.redis = redis.Redis(host='localhost', port=6379)
self.HEARTBEAT_TTL = 30 # seconds
def heartbeat(self, user_id: str):
"""Update user's last-seen timestamp with TTL."""
key = f"presence:{user_id}"
self.redis.setex(key, self.HEARTBEAT_TTL, str(time.time()))
def is_online(self, user_id: str) -> bool:
"""Check if user has an active heartbeat."""
return self.redis.exists(f"presence:{user_id}") > 0
def get_last_seen(self, user_id: str) -> float:
"""Get user's last heartbeat timestamp."""
ts = self.redis.get(f"presence:{user_id}")
return float(ts) if ts else 0
def get_online_friends(self, user_id: str) -> list:
"""Get list of online friends using pipeline for efficiency."""
friends = get_friends_list(user_id)
pipe = self.redis.pipeline()
for friend_id in friends:
pipe.exists(f"presence:{friend_id}")
results = pipe.execute()
return [f for f, online in zip(friends, results) if online]
Presence at Scale: The Hard Problem
With 50M users, broadcasting status changes to all friends creates a storm of events. Solutions:
- Lazy updates: Only fetch presence when a user opens a chat
- Pub/Sub channels: Subscribe to friends' presence changes only while app is active
- Batching: Aggregate status changes and push every 10 seconds instead of instantly
Message Storage & Retrieval
Why Cassandra for Messages?
Chat messages have a write-heavy workload with sequential reads by conversation. Cassandra excels at this pattern with its LSM-tree storage engine and partition-key-based data locality.
-- Cassandra schema for chat messages
-- Partition key: chat_id (groups messages by conversation)
-- Clustering key: msg_id (sorts messages within conversation)
CREATE TABLE messages (
chat_id UUID,
msg_id TIMEUUID,
sender_id UUID,
content TEXT,
msg_type TEXT, -- 'text', 'image', 'file'
created_at TIMESTAMP,
PRIMARY KEY (chat_id, msg_id)
) WITH CLUSTERING ORDER BY (msg_id DESC);
-- Efficient query: "Get latest 50 messages in chat"
-- SELECT * FROM messages WHERE chat_id = ? LIMIT 50;
-- User's chat list (most recent message per chat)
CREATE TABLE user_chats (
user_id UUID,
chat_id UUID,
last_msg TEXT,
last_time TIMESTAMP,
unread INT,
PRIMARY KEY (user_id, last_time)
) WITH CLUSTERING ORDER BY (last_time DESC);
Push Notifications
Push Notification Flow
- Step 1: Chat service detects recipient is offline (no active WebSocket)
- Step 2: Publish notification event to a dedicated push queue
- Step 3: Push service looks up device tokens from the user's device registry
- Step 4: Send via APNs (iOS) or FCM (Android) with message preview
- Step 5: Handle token invalidation (uninstalled apps) and rate limiting
Push Notification Pitfalls
Notification storms: In active group chats, batch notifications instead of sending one per message. Show "5 new messages in Team Chat" instead of five separate alerts.
Badge counts: Maintain an atomic unread counter per user in Redis. Increment on message delivery, reset when user opens the chat.
Practice Problems
Medium End-to-End Encryption
Design end-to-end encryption for the chat system so that even the server cannot read messages. Consider:
- How do you exchange encryption keys between users?
- How does E2E encryption work with multi-device support?
- How do you handle group chat encryption?
Look into the Signal Protocol (Double Ratchet Algorithm). Each device has its own key pair. For group chats, consider the Sender Key protocol where the sender shares a symmetric key with all group members.
# E2E Encryption Architecture (Signal Protocol)
# 1. Key Exchange (X3DH - Extended Triple Diffie-Hellman)
# - Each user registers public keys on the server
# - Identity Key (long-term), Signed Pre-Key, One-Time Pre-Keys
# - Sender fetches recipient's key bundle to establish session
# 2. Multi-device: Each device has its own key pair
# - Message is encrypted separately for each device
# - Server stores encrypted copies per device
# 3. Group Chat (Sender Keys Protocol)
# - Sender generates a symmetric "sender key"
# - Distributes sender key to all members (E2E encrypted)
# - All messages encrypted with sender key (efficient)
# - Rotate sender key when members join/leave
# Server stores only ciphertext -- cannot decrypt
Hard Message Search
Add full-text search to the chat system so users can search across all their conversations. How would you index messages while maintaining E2E encryption?
- What search engine would you use?
- How do you keep the search index in sync?
- Can you search encrypted messages?
For non-E2E: use Elasticsearch fed by a CDC pipeline from Cassandra. For E2E: search must happen client-side. Consider building a local search index on the device.
# Non-E2E Search: Server-side Elasticsearch
# - CDC (Change Data Capture) from Cassandra to Kafka
# - Kafka consumer indexes into Elasticsearch
# - Search API: GET /api/search?q=meeting&chat_id=xxx
# - Access control: only search user's own chats
# E2E Search: Client-side indexing
# - Build inverted index locally (SQLite FTS5)
# - Index decrypted messages on-device
# - Sync encrypted index blob across devices
# - Tradeoff: limited to messages stored on device
Hard Media Messages
Extend the chat system to support sending images and videos up to 100MB. How would you handle upload, storage, thumbnails, and delivery?
- Where do you store media files?
- How do you generate thumbnails for preview?
- How do you handle large file uploads reliably?
Use pre-signed URLs for direct upload to object storage (S3). Generate thumbnails asynchronously. Store only the media URL in the message record, not the file itself.
# Media Message Flow
# 1. Client requests upload URL: POST /api/media/upload
# 2. Server returns pre-signed S3 URL (15 min expiry)
# 3. Client uploads directly to S3 (chunked for large files)
# 4. S3 triggers Lambda for thumbnail generation
# 5. Client sends message with media_url reference
# 6. Recipient downloads media via CDN
# Storage: S3 with lifecycle policies
# - Hot: S3 Standard (first 30 days)
# - Warm: S3 IA (30-90 days)
# - Cold: S3 Glacier (90+ days)
# Thumbnails: Lambda + Sharp.js
# - Generate 150x150, 300x300, 600x600 variants
# - Store in separate S3 prefix: /thumbnails/
Quick Reference
Chat System Design Summary
| Component | Technology | Rationale |
|---|---|---|
| Real-time Transport | WebSocket | Full duplex, low latency, persistent |
| Message Queue | Kafka | Ordered delivery, replay, high throughput |
| Message Storage | Cassandra | Write-optimized, partition by chat_id |
| Presence | Redis | TTL-based heartbeat, sub-ms reads |
| Session Store | Redis | Map user_id to WS gateway server |
| Push | APNs + FCM | Native mobile notifications |
| Media Storage | S3 + CDN | Scalable blob storage, global delivery |
Key Takeaways
Interview Tips
- Start by clarifying: 1:1 only or group chat? Text only or media? How many users?
- WebSocket is the correct choice for real-time chat -- explain why HTTP polling falls short
- Message ordering is subtle: use per-conversation sequence numbers, not wall-clock time
- Separate the concerns: WS gateway, chat service, presence, push are all independent services
- For group chat, discuss fan-out-on-write vs fan-out-on-read tradeoffs
- Don't forget offline message delivery and multi-device synchronization