Database Recovery
This runbook covers PostgreSQL recovery procedures for the MATIH platform, including connection issues, replication lag, data corruption, and backup restoration.
Symptoms
PostgreSQLDownalert firing- Services reporting database connection errors
- Elevated query latency or timeouts
- Replication lag alerts
Impact
Database issues affect all services that depend on PostgreSQL, including the control plane services, AI service bi-temporal store, and session storage.
Connection Issues
1. Verify Database Pod Status
./scripts/tools/platform-status.shCheck that the PostgreSQL pod is running and not in a crash loop.
2. Check Connection Pool
Review application logs for connection pool exhaustion:
- Look for "too many connections" errors
- Check
pg_stat_activityfor idle connections - Verify connection pool settings in the Helm values
3. Restart Connection Pools
If connection pools are exhausted, restart the affected application services:
./scripts/tools/service-build-deploy.sh <service-name>Replication Lag
1. Check Replication Status
Monitor the replication lag metric in Grafana. If lag exceeds the threshold:
2. Investigate Causes
Common causes of replication lag:
- Heavy write workload
- Network issues between primary and replica
- Replica disk I/O bottleneck
3. Recovery
If replication lag is severe, consider:
- Scaling replica resources
- Temporarily reducing write-heavy operations
- Rebuilding the replica from a fresh base backup
Backup Restoration
1. Identify the Backup
Locate the most recent backup. Backups are managed by the disaster recovery scripts.
2. Restore from Backup
Follow the PostgreSQL backup procedures in the Disaster Recovery section.
3. Verify Data Integrity
After restoration, verify:
- All expected tables exist
- Row counts match expectations
- Application health checks pass
Escalation
If database recovery is not successful within 30 minutes:
- Escalate to the platform team
- Consider failing over to a standby database if available
- Communicate impact to affected tenant owners