MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Database Recovery

Database Recovery

This runbook covers PostgreSQL recovery procedures for the MATIH platform, including connection issues, replication lag, data corruption, and backup restoration.


Symptoms

  • PostgreSQLDown alert firing
  • Services reporting database connection errors
  • Elevated query latency or timeouts
  • Replication lag alerts

Impact

Database issues affect all services that depend on PostgreSQL, including the control plane services, AI service bi-temporal store, and session storage.


Connection Issues

1. Verify Database Pod Status

./scripts/tools/platform-status.sh

Check that the PostgreSQL pod is running and not in a crash loop.

2. Check Connection Pool

Review application logs for connection pool exhaustion:

  • Look for "too many connections" errors
  • Check pg_stat_activity for idle connections
  • Verify connection pool settings in the Helm values

3. Restart Connection Pools

If connection pools are exhausted, restart the affected application services:

./scripts/tools/service-build-deploy.sh <service-name>

Replication Lag

1. Check Replication Status

Monitor the replication lag metric in Grafana. If lag exceeds the threshold:

2. Investigate Causes

Common causes of replication lag:

  • Heavy write workload
  • Network issues between primary and replica
  • Replica disk I/O bottleneck

3. Recovery

If replication lag is severe, consider:

  • Scaling replica resources
  • Temporarily reducing write-heavy operations
  • Rebuilding the replica from a fresh base backup

Backup Restoration

1. Identify the Backup

Locate the most recent backup. Backups are managed by the disaster recovery scripts.

2. Restore from Backup

Follow the PostgreSQL backup procedures in the Disaster Recovery section.

3. Verify Data Integrity

After restoration, verify:

  • All expected tables exist
  • Row counts match expectations
  • Application health checks pass

Escalation

If database recovery is not successful within 30 minutes:

  1. Escalate to the platform team
  2. Consider failing over to a standby database if available
  3. Communicate impact to affected tenant owners