Monitoring & Metrics

Key metrics, JMX, Prometheus, Grafana dashboards, and alerting.

Intermediate 30 min read 📨 Kafka

Critical Kafka Metrics

Monitoring Kafka isn't optional — it's essential. A healthy-looking cluster can silently fall behind, lose replicas, or exhaust disk. These are the metrics that should trigger alerts:

MetricWhereAlert WhenMeaning
UnderReplicatedPartitionsBroker> 0Replicas falling behind leader — data at risk
ActiveControllerCountCluster!= 1No controller = no leader election possible
OfflinePartitionsCountController> 0Partitions with no leader — data unavailable
consumer_lagConsumerGrowing steadilyConsumer can't keep up with producer rate
RequestHandlerAvgIdlePercentBroker< 0.3Broker overloaded — add capacity
LogFlushRateAndTimeMsBroker99th > 1000msDisk I/O bottleneck

Prometheus + Grafana Setup

The standard monitoring stack for Kafka is JMX Exporter + Prometheus + Grafana. Here's how it works:

# 1. Add JMX Exporter to broker startup
export KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx_exporter/kafka-broker.yml"

# 2. Prometheus scrapes metrics from each broker on port 7071
# prometheus.yml:
# scrape_configs:
#   - job_name: 'kafka'
#     static_configs:
#       - targets: ['broker1:7071', 'broker2:7071', 'broker3:7071']

# 3. Grafana dashboards visualize the metrics
# Import dashboard ID 7589 for a comprehensive Kafka dashboard

Consumer Lag Monitoring

Consumer lag is the single most important metric for Kafka applications. It tells you how far behind your consumers are:

# Check lag via CLI
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-app

# Monitor continuously with Burrow (LinkedIn's consumer lag monitoring tool)
# Or use kafka-lag-exporter for Prometheus integration
Key Takeaway: Alert on UnderReplicatedPartitions > 0 (data safety), OfflinePartitions > 0 (availability), and consumer_lag growing (processing bottleneck). These three alerts catch 90% of Kafka issues.

⚠️ Common Mistake: Only monitoring broker health

Brokers can be 100% healthy while your consumers are falling behind. Always monitor consumer lag separately. A growing lag means your application is losing real-time capability.

Critical Kafka Metrics
UnderReplicated
Partitions at risk
Consumer Lag
Messages behind
Request Rate
Throughput
ISR Shrink
Replica health
Log Size
Disk usage
Network I/O
Bandwidth

Practice Exercises

Medium Build a Mini Project

Combine concepts from this tutorial to build a small utility or tool.

Medium Debug Challenge

Introduce a bug in one of the code examples and practice finding and fixing it.

Hard Refactoring Exercise

Rewrite one example using a different approach and compare the tradeoffs.