PostgreSQL Backup
PostgreSQL is the primary relational database for MATIH, storing control plane data, bi-temporal context graph data, session state, and tenant configurations. Backup procedures cover logical backups (pg_dump), continuous archiving (WAL), and point-in-time recovery (PITR).
Backup Strategy
| Method | Frequency | RPO | Use Case |
|---|---|---|---|
| WAL Archiving | Continuous | Minutes | Point-in-time recovery |
| pg_dump (logical) | Hourly | 1 hour | Table-level restore, cross-version migration |
| Full base backup | Daily | 24 hours | Complete cluster restore |
WAL Archiving
WAL (Write-Ahead Log) archiving provides continuous backup with near-zero RPO:
Configuration
archive_mode = on
archive_command = 'upload_to_storage %p %f'
wal_level = replicaWAL files are archived to object storage (Azure Blob Storage or AWS S3) as they are completed.
Logical Backup (pg_dump)
Scheduled Backup
Hourly logical backups are taken of all databases and uploaded to object storage. The backup process is managed by the disaster recovery scripts.
Backup Contents
| Database | Tables | Size Estimate |
|---|---|---|
matih_control_plane | IAM, tenant, config, audit | 1-10 GB |
matih_data_plane | Sessions, bi-temporal data | 5-50 GB |
context_graph | Events, decisions, entities | 10-100 GB |
Restore Procedures
Point-in-Time Recovery
- Stop the PostgreSQL instance
- Restore the most recent base backup
- Replay WAL files up to the target recovery point
- Start PostgreSQL and verify data integrity
Logical Restore
- Create a new database
- Restore from the pg_dump file
- Verify table counts and data integrity
- Update application connection strings if needed
Single Table Restore
- Restore the pg_dump to a temporary database
- Copy the specific table to the production database
- Verify data integrity
- Drop the temporary database
Verification
After any restore:
- Run application health checks
- Verify row counts for critical tables
- Test application connectivity and basic queries
- Monitor for errors in application logs
./scripts/disaster-recovery/health-check.shBackup Monitoring
| Metric | Alert Threshold | Description |
|---|---|---|
| Backup age | Over 2 hours | Last successful backup is too old |
| Backup size | Deviation over 50% | Unexpected backup size change |
| WAL archive lag | Over 100 segments | WAL archiving is falling behind |