Rolling Upgrades
Kafka supports zero-downtime upgrades by rolling through brokers one at a time. The key: never upgrade more than one broker at a time, and wait for ISR (In-Sync Replicas) to fully recover before proceeding.
1. Stop one broker
Gracefully shut down the broker. Partitions are reassigned to other brokers.
2. Upgrade the broker
Replace binaries, update configs. Don't change inter.broker.protocol.version yet.
3. Start the broker
Wait for it to rejoin the cluster and catch up on all partition replicas.
4. Verify ISR
Check UnderReplicatedPartitions = 0 before proceeding to the next broker.
5. Repeat for all brokers
After all brokers are upgraded, update inter.broker.protocol.version.
Capacity Planning
| Resource | Rule of Thumb | Monitoring |
|---|---|---|
| Disk | Daily ingest x retention days x replication factor x 1.2 | Alert at 70% usage |
| Network | Peak ingest + replication traffic + consumer reads | Alert at 70% NIC capacity |
| Memory | Allocate OS page cache, not just JVM heap | Watch page cache hit ratio |
| CPU | Usually not the bottleneck unless heavy compression | Watch RequestHandlerIdlePercent |
Incident Runbook
Broker Down
Check logs, restart broker. If ISR shrinks, prioritize recovery. Don't rebalance partitions during recovery.
Consumer Lag Growing
Check processing time, add consumers (up to partition count), check for slow external dependencies.
Disk Full
Reduce retention, delete old topics, add disk. NEVER delete log segments manually.
UnderReplicated Partitions
Check lagging broker health, network, disk I/O. May need to reassign partitions.
Practice Exercises
Hard Production Scenario
Design a solution using these concepts for a real-world production system.
Hard Performance Analysis
Benchmark two different approaches and explain which is better and why.