MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Kafka Recovery

Kafka Recovery

This runbook covers Kafka broker recovery, consumer group issues, topic partition problems, and message backlog resolution for the MATIH platform.


Symptoms

  • KafkaBrokerDown alert firing
  • Consumer lag alerts indicating growing backlogs
  • Services unable to produce or consume messages
  • Partition leader election failures

Impact

Kafka issues affect asynchronous event processing, including:

  • Context graph event streaming
  • Agent thinking trace ingestion
  • Feedback event processing
  • Session analytics pipelines

Broker Recovery

1. Check Broker Status

./scripts/tools/platform-status.sh

Verify the Kafka broker pods are running.

2. Check Broker Logs

Review broker logs for errors such as:

  • Out of memory
  • Disk full
  • Network partitions
  • ZooKeeper/KRaft connection issues

3. Restart Broker

If a broker is unhealthy, restart it. The Kafka cluster will automatically rebalance partitions.


Consumer Group Issues

1. Check Consumer Lag

Monitor consumer group lag in the Grafana Kafka dashboard. High lag indicates consumers are falling behind.

2. Common Causes

CauseResolution
Consumer crash loopFix the consumer error and restart
Processing too slowScale consumer replicas
Deserialization errorsCheck message schema compatibility
Rebalance stormIncrease session.timeout.ms and max.poll.interval.ms

3. Reset Consumer Offset

If a consumer is stuck and messages need to be skipped:

Use the appropriate Kafka admin scripts to reset the consumer group offset. The specific procedure depends on whether you want to skip to the latest offset or to a specific timestamp.


Partition Issues

Under-Replicated Partitions

Under-replicated partitions indicate that replicas are not in sync with the leader:

  1. Check if any broker is down or overloaded
  2. Verify disk space on all brokers
  3. Wait for ISR (in-sync replicas) to catch up after broker recovery

Leader Election Failure

If a partition has no leader:

  1. Check if the preferred leader broker is available
  2. Trigger a preferred leader election
  3. If the broker is permanently lost, reassign partitions

Verification

After recovery:

  1. Verify all consumer groups are consuming without lag
  2. Check that event processing is working end-to-end
  3. Verify context graph events are flowing to Dgraph
  4. Monitor for 30 minutes to ensure stability

Escalation

If Kafka recovery takes longer than 30 minutes or data loss is suspected, escalate to the platform team immediately. Consider the impact on downstream analytics and context graph consistency.