MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Disaster Recovery
DR Strategy Overview

DR Strategy Overview

MATIH implements a comprehensive disaster recovery (DR) strategy covering database backups, configuration backups, cross-region failover, Kubernetes cluster-level backups with Velero, and chaos engineering for resilience testing. The strategy targets a Recovery Point Objective (RPO) of 1 hour and a Recovery Time Objective (RTO) of 4 hours.


DR Objectives

MetricTargetDescription
RPO1 hourMaximum data loss in a disaster
RTO4 hoursMaximum time to restore service
Backup FrequencyHourly (PostgreSQL), Daily (full)How often backups are taken
Backup Retention30 daysHow long backups are kept

Subsections

PageDescription
PostgreSQL BackupDatabase backup and restore procedures
Redis BackupRedis cache and session backup
Config BackupConfiguration and secret backup
Cross-Region DRMulti-region disaster recovery
VeleroKubernetes cluster-level backup with Velero
Chaos EngineeringResilience testing and failure injection

Backup Components

ComponentBackup MethodFrequencyRetention
PostgreSQLpg_dump / WAL archivingHourly30 days
RedisRDB snapshotsEvery 6 hours7 days
Kubernetes resourcesVeleroDaily30 days
Helm valuesGit repositoryOn every changeIndefinite
SecretsExternal Secrets OperatorSynced from vaultAs per vault policy

Recovery Priority

PriorityComponentsRTO
P1PostgreSQL (control plane), API Gateway1 hour
P2Redis, Kafka, AI Service2 hours
P3Monitoring stack, Dgraph4 hours
P4Analytics, historical data8 hours