MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Disaster Recovery
Config Backup

Config Backup

Configuration backup covers Kubernetes resources, Helm release values, ConfigMaps, and Secrets. These are critical for rebuilding the platform from scratch in a disaster recovery scenario.


What to Back Up

Resource TypeBackup MethodFrequency
Helm values filesGit repositoryEvery commit
Kubernetes ConfigMapsVeleroDaily
Kubernetes SecretsExternal Secrets Operator (synced from vault)Continuous
CRDs (ServiceMonitors, Certificates)VeleroDaily
Terraform stateRemote backend (Azure Storage / S3)Every apply

Helm Values

All Helm values files are stored in Git and represent the declarative state of the platform:

Values FileServicePath
values.yamlBase defaultsinfrastructure/helm/{service}/values.yaml
values-dev.yamlDev overridesinfrastructure/helm/{service}/values-dev.yaml
values-prod.yamlProd overridesinfrastructure/helm/{service}/values-prod.yaml

Since these are in Git, they are automatically backed up and version-controlled.


ConfigMap Backup

ConfigMaps containing runtime configuration should be backed up via Velero (see Velero):

ConfigMapNamespaceContent
Grafana dashboardsmatih-monitoringDashboard JSON definitions
Prometheus rulesmatih-monitoringAlerting and recording rules
Tenant configurationsmatih-control-planePer-tenant settings

Secret Management

Secrets are never stored in Git. They are managed through:

MethodEnvironmentDescription
External Secrets OperatorProductionSyncs from Azure Key Vault / AWS Secrets Manager
dev-secrets.shDevelopmentCreates dev secrets from templates
VeleroAllCluster-level backup includes encrypted secrets

Terraform State

Terraform state is stored in a remote backend:

BackendEnvironmentPath
Azure StorageAzurematih-tfstate container
S3AWSmatih-terraform-state bucket

State locking prevents concurrent modifications.


Recovery Procedure

To rebuild the platform from backups:

  1. Provision infrastructure using Terraform
  2. Restore Kubernetes cluster resources from Velero backup
  3. Verify secrets are synced from the vault via ESO
  4. Deploy services using Helm with values from Git
  5. Restore databases from PostgreSQL backups
  6. Verify all health checks pass
./scripts/disaster-recovery/health-check.sh