Getting Started with Python

Rolling Upgrades

Kafka supports zero-downtime upgrades by rolling through brokers one at a time. The key: never upgrade more than one broker at a time, and wait for ISR (In-Sync Replicas) to fully recover before proceeding.

1. Stop one broker

Gracefully shut down the broker. Partitions are reassigned to other brokers.

2. Upgrade the broker

Replace binaries, update configs. Don't change inter.broker.protocol.version yet.

3. Start the broker

Wait for it to rejoin the cluster and catch up on all partition replicas.

4. Verify ISR

Check UnderReplicatedPartitions = 0 before proceeding to the next broker.

5. Repeat for all brokers

After all brokers are upgraded, update inter.broker.protocol.version.

Capacity Planning

Resource	Rule of Thumb	Monitoring
Disk	Daily ingest x retention days x replication factor x 1.2	Alert at 70% usage
Network	Peak ingest + replication traffic + consumer reads	Alert at 70% NIC capacity
Memory	Allocate OS page cache, not just JVM heap	Watch page cache hit ratio
CPU	Usually not the bottleneck unless heavy compression	Watch RequestHandlerIdlePercent

Incident Runbook

Broker Down

Check logs, restart broker. If ISR shrinks, prioritize recovery. Don't rebalance partitions during recovery.

Consumer Lag Growing

Check processing time, add consumers (up to partition count), check for slow external dependencies.

Disk Full

Reduce retention, delete old topics, add disk. NEVER delete log segments manually.

UnderReplicated Partitions

Check lagging broker health, network, disk I/O. May need to reassign partitions.

Key Takeaway: Practice your upgrade and failover procedures in staging before production. Document runbooks for common incidents. Monitor disk usage and consumer lag as your primary health indicators.

Production Operations

Rolling Upgrades

1. Stop one broker

2. Upgrade the broker

3. Start the broker

4. Verify ISR

5. Repeat for all brokers

Capacity Planning

Incident Runbook

Broker Down

Consumer Lag Growing

Disk Full

UnderReplicated Partitions

Practice Exercises

Hard Production Scenario

Hard Performance Analysis

Rolling Upgrades

1. Stop one broker

2. Upgrade the broker

3. Start the broker

4. Verify ISR

5. Repeat for all brokers

Capacity Planning

Incident Runbook

Broker Down

Consumer Lag Growing

Disk Full

UnderReplicated Partitions

Practice Exercises

Hard Production Scenario

Hard Performance Analysis

Related Topics