MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Disaster Recovery
PostgreSQL Backup

PostgreSQL Backup

PostgreSQL is the primary relational database for MATIH, storing control plane data, bi-temporal context graph data, session state, and tenant configurations. Backup procedures cover logical backups (pg_dump), continuous archiving (WAL), and point-in-time recovery (PITR).


Backup Strategy

MethodFrequencyRPOUse Case
WAL ArchivingContinuousMinutesPoint-in-time recovery
pg_dump (logical)Hourly1 hourTable-level restore, cross-version migration
Full base backupDaily24 hoursComplete cluster restore

WAL Archiving

WAL (Write-Ahead Log) archiving provides continuous backup with near-zero RPO:

Configuration

archive_mode = on
archive_command = 'upload_to_storage %p %f'
wal_level = replica

WAL files are archived to object storage (Azure Blob Storage or AWS S3) as they are completed.


Logical Backup (pg_dump)

Scheduled Backup

Hourly logical backups are taken of all databases and uploaded to object storage. The backup process is managed by the disaster recovery scripts.

Backup Contents

DatabaseTablesSize Estimate
matih_control_planeIAM, tenant, config, audit1-10 GB
matih_data_planeSessions, bi-temporal data5-50 GB
context_graphEvents, decisions, entities10-100 GB

Restore Procedures

Point-in-Time Recovery

  1. Stop the PostgreSQL instance
  2. Restore the most recent base backup
  3. Replay WAL files up to the target recovery point
  4. Start PostgreSQL and verify data integrity

Logical Restore

  1. Create a new database
  2. Restore from the pg_dump file
  3. Verify table counts and data integrity
  4. Update application connection strings if needed

Single Table Restore

  1. Restore the pg_dump to a temporary database
  2. Copy the specific table to the production database
  3. Verify data integrity
  4. Drop the temporary database

Verification

After any restore:

  1. Run application health checks
  2. Verify row counts for critical tables
  3. Test application connectivity and basic queries
  4. Monitor for errors in application logs
./scripts/disaster-recovery/health-check.sh

Backup Monitoring

MetricAlert ThresholdDescription
Backup ageOver 2 hoursLast successful backup is too old
Backup sizeDeviation over 50%Unexpected backup size change
WAL archive lagOver 100 segmentsWAL archiving is falling behind