Cross-Region DR
Cross-region disaster recovery ensures MATIH can survive a complete regional failure by maintaining standby infrastructure in a secondary region. This covers active-passive failover for databases, cross-region replication for object storage, and DNS-based traffic routing.
Architecture
+---------------------+ +---------------------+
| Primary Region | | Secondary Region |
| (Active) | | (Passive Standby) |
| | | |
| +-- K8s Cluster | Sync | +-- K8s Cluster |
| +-- PostgreSQL ---|-------->| +-- PostgreSQL |
| +-- Redis ---|-------->| +-- Redis |
| +-- Object Store---|-------->| +-- Object Store |
+---------------------+ +---------------------+
| |
+--------- DNS ---------------+
(Traffic Manager / Route 53)Replication Strategy
| Component | Replication Method | RPO |
|---|---|---|
| PostgreSQL | Streaming replication | Minutes |
| Redis | Redis replication | Minutes |
| Object Storage | Cross-region replication | Near real-time |
| Kubernetes State | Velero to cross-region storage | Daily |
| Terraform State | Remote backend in both regions | On every apply |
Failover Procedure
1. Detect Regional Failure
Regional failure is detected when:
- Multiple
ServiceDownalerts fire simultaneously - Health check endpoint is unreachable
- Cloud provider status page reports regional issues
2. Promote Secondary Database
Promote the PostgreSQL standby in the secondary region to primary.
3. Update DNS
Switch DNS traffic routing from the primary to the secondary region using the DNS provider's traffic management:
| Provider | Method |
|---|---|
| Azure | Traffic Manager failover |
| AWS | Route 53 health check failover |
4. Scale Secondary Services
Scale up services in the secondary region to handle production traffic.
5. Verify
Run health checks against the secondary region to confirm all services are operational.
Failback Procedure
Once the primary region is restored:
- Rebuild or verify the primary region infrastructure
- Resync databases from the secondary (now primary) region
- Verify data consistency
- Switch DNS back to the primary region
- Demote the secondary database back to standby
Testing
Cross-region failover should be tested quarterly:
- Simulate primary region failure
- Execute failover procedure
- Run integration tests against the secondary region
- Verify RTO and RPO targets are met
- Execute failback procedure
- Document results and update procedures
Cost Considerations
| Component | Running Cost | Description |
|---|---|---|
| Standby K8s cluster | Reduced (minimal replicas) | Scale to minimum in standby |
| PostgreSQL replica | Full | Must maintain streaming replication |
| Object storage replica | Storage cost only | Cross-region replication fees |
| DNS Traffic Manager | Low | Per-query pricing |