Kafka Recovery
This runbook covers Kafka broker recovery, consumer group issues, topic partition problems, and message backlog resolution for the MATIH platform.
Symptoms
KafkaBrokerDownalert firing- Consumer lag alerts indicating growing backlogs
- Services unable to produce or consume messages
- Partition leader election failures
Impact
Kafka issues affect asynchronous event processing, including:
- Context graph event streaming
- Agent thinking trace ingestion
- Feedback event processing
- Session analytics pipelines
Broker Recovery
1. Check Broker Status
./scripts/tools/platform-status.shVerify the Kafka broker pods are running.
2. Check Broker Logs
Review broker logs for errors such as:
- Out of memory
- Disk full
- Network partitions
- ZooKeeper/KRaft connection issues
3. Restart Broker
If a broker is unhealthy, restart it. The Kafka cluster will automatically rebalance partitions.
Consumer Group Issues
1. Check Consumer Lag
Monitor consumer group lag in the Grafana Kafka dashboard. High lag indicates consumers are falling behind.
2. Common Causes
| Cause | Resolution |
|---|---|
| Consumer crash loop | Fix the consumer error and restart |
| Processing too slow | Scale consumer replicas |
| Deserialization errors | Check message schema compatibility |
| Rebalance storm | Increase session.timeout.ms and max.poll.interval.ms |
3. Reset Consumer Offset
If a consumer is stuck and messages need to be skipped:
Use the appropriate Kafka admin scripts to reset the consumer group offset. The specific procedure depends on whether you want to skip to the latest offset or to a specific timestamp.
Partition Issues
Under-Replicated Partitions
Under-replicated partitions indicate that replicas are not in sync with the leader:
- Check if any broker is down or overloaded
- Verify disk space on all brokers
- Wait for ISR (in-sync replicas) to catch up after broker recovery
Leader Election Failure
If a partition has no leader:
- Check if the preferred leader broker is available
- Trigger a preferred leader election
- If the broker is permanently lost, reassign partitions
Verification
After recovery:
- Verify all consumer groups are consuming without lag
- Check that event processing is working end-to-end
- Verify context graph events are flowing to Dgraph
- Monitor for 30 minutes to ensure stability
Escalation
If Kafka recovery takes longer than 30 minutes or data loss is suspected, escalate to the platform team immediately. Consider the impact on downstream analytics and context graph consistency.