Critical Kafka Metrics
Monitoring Kafka isn't optional — it's essential. A healthy-looking cluster can silently fall behind, lose replicas, or exhaust disk. These are the metrics that should trigger alerts:
| Metric | Where | Alert When | Meaning |
|---|---|---|---|
UnderReplicatedPartitions | Broker | > 0 | Replicas falling behind leader — data at risk |
ActiveControllerCount | Cluster | != 1 | No controller = no leader election possible |
OfflinePartitionsCount | Controller | > 0 | Partitions with no leader — data unavailable |
consumer_lag | Consumer | Growing steadily | Consumer can't keep up with producer rate |
RequestHandlerAvgIdlePercent | Broker | < 0.3 | Broker overloaded — add capacity |
LogFlushRateAndTimeMs | Broker | 99th > 1000ms | Disk I/O bottleneck |
Prometheus + Grafana Setup
The standard monitoring stack for Kafka is JMX Exporter + Prometheus + Grafana. Here's how it works:
# 1. Add JMX Exporter to broker startup
export KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx_exporter/kafka-broker.yml"
# 2. Prometheus scrapes metrics from each broker on port 7071
# prometheus.yml:
# scrape_configs:
# - job_name: 'kafka'
# static_configs:
# - targets: ['broker1:7071', 'broker2:7071', 'broker3:7071']
# 3. Grafana dashboards visualize the metrics
# Import dashboard ID 7589 for a comprehensive Kafka dashboard
Consumer Lag Monitoring
Consumer lag is the single most important metric for Kafka applications. It tells you how far behind your consumers are:
# Check lag via CLI
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group my-app
# Monitor continuously with Burrow (LinkedIn's consumer lag monitoring tool)
# Or use kafka-lag-exporter for Prometheus integration
UnderReplicatedPartitions > 0 (data safety), OfflinePartitions > 0 (availability), and consumer_lag growing (processing bottleneck). These three alerts catch 90% of Kafka issues.⚠️ Common Mistake: Only monitoring broker health
Brokers can be 100% healthy while your consumers are falling behind. Always monitor consumer lag separately. A growing lag means your application is losing real-time capability.
Partitions at risk
Messages behind
Throughput
Replica health
Disk usage
Bandwidth
Practice Exercises
Medium Build a Mini Project
Combine concepts from this tutorial to build a small utility or tool.
Medium Debug Challenge
Introduce a bug in one of the code examples and practice finding and fixing it.
Hard Refactoring Exercise
Rewrite one example using a different approach and compare the tradeoffs.