DR Strategy Overview
MATIH implements a comprehensive disaster recovery (DR) strategy covering database backups, configuration backups, cross-region failover, Kubernetes cluster-level backups with Velero, and chaos engineering for resilience testing. The strategy targets a Recovery Point Objective (RPO) of 1 hour and a Recovery Time Objective (RTO) of 4 hours.
DR Objectives
| Metric | Target | Description |
|---|---|---|
| RPO | 1 hour | Maximum data loss in a disaster |
| RTO | 4 hours | Maximum time to restore service |
| Backup Frequency | Hourly (PostgreSQL), Daily (full) | How often backups are taken |
| Backup Retention | 30 days | How long backups are kept |
Subsections
| Page | Description |
|---|---|
| PostgreSQL Backup | Database backup and restore procedures |
| Redis Backup | Redis cache and session backup |
| Config Backup | Configuration and secret backup |
| Cross-Region DR | Multi-region disaster recovery |
| Velero | Kubernetes cluster-level backup with Velero |
| Chaos Engineering | Resilience testing and failure injection |
Backup Components
| Component | Backup Method | Frequency | Retention |
|---|---|---|---|
| PostgreSQL | pg_dump / WAL archiving | Hourly | 30 days |
| Redis | RDB snapshots | Every 6 hours | 7 days |
| Kubernetes resources | Velero | Daily | 30 days |
| Helm values | Git repository | On every change | Indefinite |
| Secrets | External Secrets Operator | Synced from vault | As per vault policy |
Recovery Priority
| Priority | Components | RTO |
|---|---|---|
| P1 | PostgreSQL (control plane), API Gateway | 1 hour |
| P2 | Redis, Kafka, AI Service | 2 hours |
| P3 | Monitoring stack, Dgraph | 4 hours |
| P4 | Analytics, historical data | 8 hours |