MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Disaster Recovery
Cross-Region DR

Cross-Region DR

Cross-region disaster recovery ensures MATIH can survive a complete regional failure by maintaining standby infrastructure in a secondary region. This covers active-passive failover for databases, cross-region replication for object storage, and DNS-based traffic routing.


Architecture

+---------------------+         +---------------------+
|  Primary Region     |         |  Secondary Region   |
|  (Active)           |         |  (Passive Standby)  |
|                     |         |                     |
|  +-- K8s Cluster    |  Sync  |  +-- K8s Cluster    |
|  +-- PostgreSQL  ---|-------->|  +-- PostgreSQL     |
|  +-- Redis       ---|-------->|  +-- Redis          |
|  +-- Object Store---|-------->|  +-- Object Store   |
+---------------------+         +---------------------+
          |                              |
          +--------- DNS ---------------+
          (Traffic Manager / Route 53)

Replication Strategy

ComponentReplication MethodRPO
PostgreSQLStreaming replicationMinutes
RedisRedis replicationMinutes
Object StorageCross-region replicationNear real-time
Kubernetes StateVelero to cross-region storageDaily
Terraform StateRemote backend in both regionsOn every apply

Failover Procedure

1. Detect Regional Failure

Regional failure is detected when:

  • Multiple ServiceDown alerts fire simultaneously
  • Health check endpoint is unreachable
  • Cloud provider status page reports regional issues

2. Promote Secondary Database

Promote the PostgreSQL standby in the secondary region to primary.

3. Update DNS

Switch DNS traffic routing from the primary to the secondary region using the DNS provider's traffic management:

ProviderMethod
AzureTraffic Manager failover
AWSRoute 53 health check failover

4. Scale Secondary Services

Scale up services in the secondary region to handle production traffic.

5. Verify

Run health checks against the secondary region to confirm all services are operational.


Failback Procedure

Once the primary region is restored:

  1. Rebuild or verify the primary region infrastructure
  2. Resync databases from the secondary (now primary) region
  3. Verify data consistency
  4. Switch DNS back to the primary region
  5. Demote the secondary database back to standby

Testing

Cross-region failover should be tested quarterly:

  1. Simulate primary region failure
  2. Execute failover procedure
  3. Run integration tests against the secondary region
  4. Verify RTO and RPO targets are met
  5. Execute failback procedure
  6. Document results and update procedures

Cost Considerations

ComponentRunning CostDescription
Standby K8s clusterReduced (minimal replicas)Scale to minimum in standby
PostgreSQL replicaFullMust maintain streaming replication
Object storage replicaStorage cost onlyCross-region replication fees
DNS Traffic ManagerLowPer-query pricing