Cross-Region DR

Cross-region disaster recovery ensures MATIH can survive a complete regional failure by maintaining standby infrastructure in a secondary region. This covers active-passive failover for databases, cross-region replication for object storage, and DNS-based traffic routing.

Architecture

+---------------------+         +---------------------+
|  Primary Region     |         |  Secondary Region   |
|  (Active)           |         |  (Passive Standby)  |
|                     |         |                     |
|  +-- K8s Cluster    |  Sync  |  +-- K8s Cluster    |
|  +-- PostgreSQL  ---|-------->|  +-- PostgreSQL     |
|  +-- Redis       ---|-------->|  +-- Redis          |
|  +-- Object Store---|-------->|  +-- Object Store   |
+---------------------+         +---------------------+
          |                              |
          +--------- DNS ---------------+
          (Traffic Manager / Route 53)

Replication Strategy

Component	Replication Method	RPO
PostgreSQL	Streaming replication	Minutes
Redis	Redis replication	Minutes
Object Storage	Cross-region replication	Near real-time
Kubernetes State	Velero to cross-region storage	Daily
Terraform State	Remote backend in both regions	On every apply

Failover Procedure

1. Detect Regional Failure

Regional failure is detected when:

Multiple ServiceDown alerts fire simultaneously
Health check endpoint is unreachable
Cloud provider status page reports regional issues

2. Promote Secondary Database

Promote the PostgreSQL standby in the secondary region to primary.

3. Update DNS

Switch DNS traffic routing from the primary to the secondary region using the DNS provider's traffic management:

Provider	Method
Azure	Traffic Manager failover
AWS	Route 53 health check failover

4. Scale Secondary Services

Scale up services in the secondary region to handle production traffic.

5. Verify

Run health checks against the secondary region to confirm all services are operational.

Failback Procedure

Once the primary region is restored:

Rebuild or verify the primary region infrastructure
Resync databases from the secondary (now primary) region
Verify data consistency
Switch DNS back to the primary region
Demote the secondary database back to standby

Testing

Cross-region failover should be tested quarterly:

Simulate primary region failure
Execute failover procedure
Run integration tests against the secondary region
Verify RTO and RPO targets are met
Execute failback procedure
Document results and update procedures

Cost Considerations

Component	Running Cost	Description
Standby K8s cluster	Reduced (minimal replicas)	Scale to minimum in standby
PostgreSQL replica	Full	Must maintain streaming replication
Object storage replica	Storage cost only	Cross-region replication fees
DNS Traffic Manager	Low	Per-query pricing

Config Backup Velero Backup Operator