Getting Started with Python

Critical Kafka Metrics

Monitoring Kafka isn't optional — it's essential. A healthy-looking cluster can silently fall behind, lose replicas, or exhaust disk. These are the metrics that should trigger alerts:

Metric	Where	Alert When	Meaning
`UnderReplicatedPartitions`	Broker	> 0	Replicas falling behind leader — data at risk
`ActiveControllerCount`	Cluster	!= 1	No controller = no leader election possible
`OfflinePartitionsCount`	Controller	> 0	Partitions with no leader — data unavailable
`consumer_lag`	Consumer	Growing steadily	Consumer can't keep up with producer rate
`RequestHandlerAvgIdlePercent`	Broker	< 0.3	Broker overloaded — add capacity
`LogFlushRateAndTimeMs`	Broker	99th > 1000ms	Disk I/O bottleneck

Prometheus + Grafana Setup

The standard monitoring stack for Kafka is JMX Exporter + Prometheus + Grafana. Here's how it works:

# 1. Add JMX Exporter to broker startup
export KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx_exporter/kafka-broker.yml"

# 2. Prometheus scrapes metrics from each broker on port 7071
# prometheus.yml:
# scrape_configs:
#   - job_name: 'kafka'
#     static_configs:
#       - targets: ['broker1:7071', 'broker2:7071', 'broker3:7071']

# 3. Grafana dashboards visualize the metrics
# Import dashboard ID 7589 for a comprehensive Kafka dashboard

Consumer Lag Monitoring

Consumer lag is the single most important metric for Kafka applications. It tells you how far behind your consumers are:

# Check lag via CLI
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group my-app

# Monitor continuously with Burrow (LinkedIn's consumer lag monitoring tool)
# Or use kafka-lag-exporter for Prometheus integration

Key Takeaway: Alert on UnderReplicatedPartitions > 0 (data safety), OfflinePartitions > 0 (availability), and consumer_lag growing (processing bottleneck). These three alerts catch 90% of Kafka issues.

⚠️ Common Mistake: Only monitoring broker health

Brokers can be 100% healthy while your consumers are falling behind. Always monitor consumer lag separately. A growing lag means your application is losing real-time capability.

Critical Kafka Metrics

UnderReplicated
Partitions at risk

Consumer Lag
Messages behind

Request Rate
Throughput

ISR Shrink
Replica health

Log Size
Disk usage

Network I/O
Bandwidth

Practice Exercises

Medium Build a Mini Project

Combine concepts from this tutorial to build a small utility or tool.

Medium Debug Challenge

Introduce a bug in one of the code examples and practice finding and fixing it.

Hard Refactoring Exercise

Rewrite one example using a different approach and compare the tradeoffs.

Critical Kafka Metrics

Prometheus + Grafana Setup

Consumer Lag Monitoring

⚠️ Common Mistake: Only monitoring broker health

Practice Exercises

Medium Build a Mini Project

Medium Debug Challenge

Hard Refactoring Exercise

Related Topics