Production Operations

Cluster management, upgrades, capacity planning, and incident response.

Advanced 40 min read 📨 Kafka

Rolling Upgrades

Kafka supports zero-downtime upgrades by rolling through brokers one at a time. The key: never upgrade more than one broker at a time, and wait for ISR (In-Sync Replicas) to fully recover before proceeding.

1. Stop one broker

Gracefully shut down the broker. Partitions are reassigned to other brokers.

2. Upgrade the broker

Replace binaries, update configs. Don't change inter.broker.protocol.version yet.

3. Start the broker

Wait for it to rejoin the cluster and catch up on all partition replicas.

4. Verify ISR

Check UnderReplicatedPartitions = 0 before proceeding to the next broker.

5. Repeat for all brokers

After all brokers are upgraded, update inter.broker.protocol.version.

Capacity Planning

ResourceRule of ThumbMonitoring
DiskDaily ingest x retention days x replication factor x 1.2Alert at 70% usage
NetworkPeak ingest + replication traffic + consumer readsAlert at 70% NIC capacity
MemoryAllocate OS page cache, not just JVM heapWatch page cache hit ratio
CPUUsually not the bottleneck unless heavy compressionWatch RequestHandlerIdlePercent

Incident Runbook

Broker Down

Check logs, restart broker. If ISR shrinks, prioritize recovery. Don't rebalance partitions during recovery.

Consumer Lag Growing

Check processing time, add consumers (up to partition count), check for slow external dependencies.

Disk Full

Reduce retention, delete old topics, add disk. NEVER delete log segments manually.

UnderReplicated Partitions

Check lagging broker health, network, disk I/O. May need to reassign partitions.

Key Takeaway: Practice your upgrade and failover procedures in staging before production. Document runbooks for common incidents. Monitor disk usage and consumer lag as your primary health indicators.

Practice Exercises

Hard Production Scenario

Design a solution using these concepts for a real-world production system.

Hard Performance Analysis

Benchmark two different approaches and explain which is better and why.