Kafka Recovery

This runbook covers Kafka broker recovery, consumer group issues, topic partition problems, and message backlog resolution for the MATIH platform.

Symptoms

KafkaBrokerDown alert firing
Consumer lag alerts indicating growing backlogs
Services unable to produce or consume messages
Partition leader election failures

Impact

Kafka issues affect asynchronous event processing, including:

Context graph event streaming
Agent thinking trace ingestion
Feedback event processing
Session analytics pipelines

Broker Recovery

1. Check Broker Status

./scripts/tools/platform-status.sh

Verify the Kafka broker pods are running.

2. Check Broker Logs

Review broker logs for errors such as:

Out of memory
Disk full
Network partitions
ZooKeeper/KRaft connection issues

3. Restart Broker

If a broker is unhealthy, restart it. The Kafka cluster will automatically rebalance partitions.

Consumer Group Issues

1. Check Consumer Lag

Monitor consumer group lag in the Grafana Kafka dashboard. High lag indicates consumers are falling behind.

2. Common Causes

Cause	Resolution
Consumer crash loop	Fix the consumer error and restart
Processing too slow	Scale consumer replicas
Deserialization errors	Check message schema compatibility
Rebalance storm	Increase `session.timeout.ms` and `max.poll.interval.ms`

3. Reset Consumer Offset

If a consumer is stuck and messages need to be skipped:

Use the appropriate Kafka admin scripts to reset the consumer group offset. The specific procedure depends on whether you want to skip to the latest offset or to a specific timestamp.

Partition Issues

Under-Replicated Partitions

Under-replicated partitions indicate that replicas are not in sync with the leader:

Check if any broker is down or overloaded
Verify disk space on all brokers
Wait for ISR (in-sync replicas) to catch up after broker recovery

Leader Election Failure

If a partition has no leader:

Check if the preferred leader broker is available
Trigger a preferred leader election
If the broker is permanently lost, reassign partitions

Verification

After recovery:

Verify all consumer groups are consuming without lag
Check that event processing is working end-to-end
Verify context graph events are flowing to Dgraph
Monitor for 30 minutes to ensure stability

Escalation

If Kafka recovery takes longer than 30 minutes or data loss is suspected, escalate to the platform team immediately. Consider the impact on downstream analytics and context graph consistency.

Database Recovery Scaling Procedures