Message Queues & Event-Driven Architecture

Medium 25 min read

Why Message Queues Matter

Why This Matters

The Problem: Synchronous communication between services creates tight coupling. If Service B is down, Service A fails too. If Service B is slow, Service A blocks waiting.

The Solution: Message queues decouple producers from consumers. The producer sends a message and moves on. The consumer processes it when ready.

Real Impact: LinkedIn processes over 7 trillion messages per day through Apache Kafka, enabling real-time analytics, notifications, and data pipelines.

Real-World Analogy

Think of a message queue like a post office:

  • Producer = Person dropping off a letter (sends and forgets)
  • Queue = Post office mailbox (stores messages safely)
  • Consumer = Recipient (picks up mail on their schedule)
  • Topic = Address routing (messages go to the right destination)
  • Acknowledgment = Delivery confirmation (consumer confirms receipt)

Benefits of Asynchronous Messaging

Decoupling

Producers and consumers can be developed, deployed, and scaled independently. Neither needs to know about the other's implementation.

Resilience

If a consumer crashes, messages persist in the queue. When it recovers, processing resumes from where it left off. No data loss.

Load Leveling

Queues absorb traffic spikes. Even if 10,000 requests arrive in a second, consumers process them at their own sustainable rate.

Fan-Out

A single event can trigger multiple independent consumers: send an email, update analytics, notify a partner system -- all from one message.

Point-to-Point vs Pub/Sub

Pub/Sub Architecture
Producer A Producer B Message Broker Topic: order-events P0 P1 P2 Consumer Group: Email Sends order confirmations Reads P0 Consumer Group: Analytics Updates dashboards Reads P1 Consumer Group: Inventory Adjusts stock counts Reads P2 Each consumer group gets every message independently Within a group, partitions are load-balanced across consumers
Aspect Point-to-Point Pub/Sub
Delivery One message to one consumer One message to many subscribers
Use Case Task distribution, work queues Event broadcasting, notifications
Coupling Producer aware of queue Producer unaware of subscribers
Example Job processing queue Order placed triggers email + analytics + inventory

Apache Kafka Deep Dive

Apache Kafka is a distributed event streaming platform capable of handling trillions of events per day. Originally developed at LinkedIn, it is now used by over 80% of Fortune 100 companies for real-time data pipelines and streaming applications.

Kafka Partition Architecture
Topic: user-events Partition 0 Offset 0 Offset 1 Offset 2 Offset 3 Offset 4 Leader: Broker 1 Partition 1 Offset 0 Offset 1 Offset 2 Leader: Broker 2 Partition 2 Offset 0 Offset 1 Offset 2 Offset 3 Leader: Broker 3 Messages are ordered within a partition. Partitions enable parallel consumption.

Key Kafka Concepts

Kafka Architecture Components

  • Broker: A single Kafka server that stores data and serves clients. A cluster has multiple brokers.
  • Topic: A category or feed name to which records are published. Analogous to a database table.
  • Partition: A topic is split into partitions for parallelism. Each partition is an ordered, immutable sequence of records.
  • Offset: A sequential ID uniquely identifying each record within a partition.
  • Consumer Group: A set of consumers that cooperatively consume a topic. Each partition is assigned to exactly one consumer in the group.
  • Replication Factor: Number of copies of each partition across brokers. A factor of 3 tolerates 2 broker failures.
kafka_example.py
from kafka import KafkaProducer, KafkaConsumer
import json

# ─── Producer: Send order events ───

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    key_serializer=lambda k: k.encode('utf-8'),
    acks='all',           # Wait for all replicas
    retries=3,             # Retry on transient failures
    linger_ms=10,           # Batch messages for efficiency
)

def publish_order_event(order):
    """Publish an order event, keyed by user_id for partition ordering."""
    future = producer.send(
        topic='order-events',
        key=str(order['user_id']),
        value={
            'event_type': 'order_placed',
            'order_id': order['id'],
            'user_id': order['user_id'],
            'total': order['total'],
            'items': order['items'],
        }
    )
    # Block until sent (or timeout)
    record_metadata = future.get(timeout=10)
    print(f"Sent to partition {record_metadata.partition} offset {record_metadata.offset}")


# ─── Consumer: Process order events ───

consumer = KafkaConsumer(
    'order-events',
    bootstrap_servers=['localhost:9092'],
    group_id='email-notifications',
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    auto_offset_reset='earliest',    # Start from beginning if no offset
    enable_auto_commit=False,       # Manual commit for reliability
)

for message in consumer:
    event = message.value
    print(f"Processing: {event['event_type']} for order {event['order_id']}")

    # Process the event (send email, etc.)
    send_order_confirmation_email(event)

    # Commit offset after successful processing
    consumer.commit()

RabbitMQ vs Kafka

Feature RabbitMQ Apache Kafka
Model Message broker (push) Distributed log (pull)
Message Retention Deleted after acknowledgment Retained for configurable duration
Ordering Per-queue FIFO Per-partition ordering
Throughput ~50K messages/sec ~1M+ messages/sec
Replay Not supported (message consumed once) Consumers can rewind to any offset
Best For Task queues, routing, complex patterns Event streaming, data pipelines, logs
Protocol AMQP, MQTT, STOMP Custom binary over TCP

Event-Driven Architecture Patterns

Event Notification

Services emit events when something happens. Consumers decide what to do. Minimal coupling, but consumers may need to call back for full details.

Event-Carried State Transfer

Events contain the full state needed by consumers. No callbacks required. Higher bandwidth but complete decoupling.

Event Sourcing

Store all changes as a sequence of events rather than current state. Enables full audit trail, time-travel, and rebuilding state from events.

CQRS (Command Query Separation)

Separate write (command) and read (query) models. Write to an event store, project into read-optimized views. Scales reads and writes independently.

Common Pitfall: Distributed Transactions

Problem: You need to update a database AND publish an event atomically. If either fails, the system becomes inconsistent.

Solution: Use the Outbox Pattern: write the event to an outbox table in the same database transaction, then a separate process reads the outbox and publishes to the message broker. This guarantees at-least-once delivery without distributed transactions.

Exactly-Once Delivery

Message delivery guarantees are one of the hardest problems in distributed systems. Understanding the trade-offs is essential.

Guarantee Description Trade-off
At-Most-Once Message may be lost, never duplicated Fire-and-forget. Fastest but messages can be lost.
At-Least-Once Message never lost, may be duplicated Retries ensure delivery. Consumer must be idempotent.
Exactly-Once Message delivered exactly once Highest overhead. Kafka achieves this with idempotent producers + transactional consumers.

Pro Tip: Idempotency is Key

In practice, most systems use at-least-once delivery with idempotent consumers. This means processing the same message twice produces the same result. Use techniques like:

  • Deduplication table: Store processed message IDs and skip duplicates
  • Idempotency keys: Use a unique request ID to ensure operations are applied only once
  • Database upserts: Use INSERT ON CONFLICT UPDATE instead of plain INSERT

Practice Problems

Medium Order Processing Pipeline

Design a message queue system for an e-commerce order pipeline:

  1. Order placed triggers: payment processing, inventory update, email notification
  2. Payment success triggers: shipping label creation
  3. Handle failures at each stage without losing orders

Use a topic per event type (order-placed, payment-completed, etc.). Each downstream service subscribes to the relevant topic. Use dead-letter queues for failed messages.

# Topics:
# 1. order-events (order placed, updated, cancelled)
# 2. payment-events (payment success, failure)
# 3. shipping-events (label created, shipped)
# 4. notification-events (email, SMS triggers)

# Consumer Groups:
# - payment-service subscribes to order-events
# - inventory-service subscribes to order-events
# - email-service subscribes to order-events + payment-events
# - shipping-service subscribes to payment-events

# Dead Letter Queue (DLQ) pattern:
def process_with_retry(message, max_retries=3):
    for attempt in range(max_retries):
        try:
            process_order(message)
            return True
        except TransientError:
            time.sleep(2 ** attempt)  # Exponential backoff
    # All retries failed - send to DLQ
    producer.send('order-events-dlq', value=message)
    return False

Medium Kafka Partition Strategy

You have a Kafka topic receiving 100K events/sec for a ride-sharing app:

  1. Choose an appropriate partition key to ensure ride events for the same ride are ordered
  2. Determine the number of partitions needed if each consumer handles 5K events/sec
  3. Handle the hot-partition problem if some cities have 10x more rides

Use ride_id as the partition key for ordering. You need at least 100K/5K = 20 partitions. For hot partitions, consider a compound key or random suffix for high-volume cities.

# 1. Partition key: ride_id (ensures all events for one ride go to same partition)
producer.send('ride-events', key=ride_id, value=event)

# 2. Partitions needed: ceil(100K / 5K) = 20 minimum
#    Add headroom: 30-40 partitions for growth

# 3. Hot partition solution: compound key
def partition_key(ride_id, city):
    if city in HIGH_VOLUME_CITIES:
        # Add random suffix to spread across partitions
        shard = hash(ride_id) % 10
        return f"{city}-{shard}"
    return ride_id  # Normal cities use ride_id

Hard Event Sourcing Design

Design an event-sourced banking system:

  1. Account balance is derived from a stream of debit/credit events
  2. Support querying balance at any point in time
  3. Handle concurrent transactions on the same account

Store events in an append-only log with sequence numbers. Use snapshots every N events for fast state recovery. Use optimistic concurrency control (expected version) to handle conflicts.

class EventSourcedAccount:
    def __init__(self, account_id):
        self.account_id = account_id
        self.balance = 0
        self.version = 0
        self.events = []

    def apply_event(self, event):
        if event['type'] == 'credited':
            self.balance += event['amount']
        elif event['type'] == 'debited':
            self.balance -= event['amount']
        self.version += 1

    def debit(self, amount, expected_version):
        # Optimistic concurrency control
        if self.version != expected_version:
            raise ConcurrencyError("Account modified")
        if self.balance < amount:
            raise InsufficientFunds()
        event = {'type': 'debited', 'amount': amount}
        self.apply_event(event)
        self.events.append(event)

    def balance_at(self, target_version):
        """Replay events up to a point in time."""
        balance = 0
        for event in self.events[:target_version]:
            if event['type'] == 'credited':
                balance += event['amount']
            elif event['type'] == 'debited':
                balance -= event['amount']
        return balance

Quick Reference

Message Queue Decision Guide

Requirement Recommended Tool Reason
Task distribution RabbitMQ / SQS Push-based, per-message acknowledgment
Event streaming Apache Kafka Log-based, replay, high throughput
Serverless fan-out AWS SNS + SQS Managed, no infrastructure to run
Real-time analytics Kafka + Flink/Spark Stream processing on top of event log
Simple pub/sub Redis Pub/Sub Lightweight, in-memory, no persistence

Key Takeaways

  • Message queues decouple services and enable asynchronous processing
  • Kafka is a distributed log; RabbitMQ is a traditional message broker
  • Choose partition keys carefully to maintain ordering and avoid hot spots
  • Design consumers to be idempotent for at-least-once delivery
  • Use the outbox pattern for atomic database + event operations
  • Dead-letter queues catch messages that fail processing repeatedly