Database Recovery

This runbook covers PostgreSQL recovery procedures for the MATIH platform, including connection issues, replication lag, data corruption, and backup restoration.

Symptoms

PostgreSQLDown alert firing
Services reporting database connection errors
Elevated query latency or timeouts
Replication lag alerts

Impact

Database issues affect all services that depend on PostgreSQL, including the control plane services, AI service bi-temporal store, and session storage.

Connection Issues

1. Verify Database Pod Status

./scripts/tools/platform-status.sh

Check that the PostgreSQL pod is running and not in a crash loop.

2. Check Connection Pool

Review application logs for connection pool exhaustion:

Look for "too many connections" errors
Check pg_stat_activity for idle connections
Verify connection pool settings in the Helm values

3. Restart Connection Pools

If connection pools are exhausted, restart the affected application services:

./scripts/tools/service-build-deploy.sh <service-name>

Replication Lag

1. Check Replication Status

Monitor the replication lag metric in Grafana. If lag exceeds the threshold:

2. Investigate Causes

Common causes of replication lag:

Heavy write workload
Network issues between primary and replica
Replica disk I/O bottleneck

3. Recovery

If replication lag is severe, consider:

Scaling replica resources
Temporarily reducing write-heavy operations
Rebuilding the replica from a fresh base backup

Backup Restoration

1. Identify the Backup

Locate the most recent backup. Backups are managed by the disaster recovery scripts.

2. Restore from Backup

Follow the PostgreSQL backup procedures in the Disaster Recovery section.

3. Verify Data Integrity

After restoration, verify:

All expected tables exist
Row counts match expectations
Application health checks pass

Escalation

If database recovery is not successful within 30 minutes:

Escalate to the platform team
Consider failing over to a standby database if available
Communicate impact to affected tenant owners

Service Restart Kafka Recovery