MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Disaster Recovery
Velero Backup Operator

Velero Backup Operator

Velero provides cluster-level backup and restore for Kubernetes resources and persistent volumes. MATIH uses Velero for daily backups of all Kubernetes resources, including Deployments, Services, ConfigMaps, Secrets, CRDs, and PersistentVolumeClaims.


Installation

# Install Velero CLI
brew install velero
 
# Install Velero server with Azure plugin
velero install \
  --provider azure \
  --plugins velero/velero-plugin-for-microsoft-azure:v1.9.0 \
  --bucket matih-velero-backups \
  --backup-location-config resourceGroup=matih-rg,storageAccount=matihvelero \
  --secret-file ./credentials-velero \
  --use-node-agent

Backup Schedule

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: matih-daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM UTC
  template:
    includedNamespaces:
      - matih-control-plane
      - matih-data-plane
      - matih-monitoring
    includedResources:
      - deployments
      - services
      - configmaps
      - secrets
      - persistentvolumeclaims
      - serviceaccounts
      - ingresses
      - certificates
      - servicemonitors
    ttl: 720h  # 30 days retention
    snapshotVolumes: true

Backup Types

TypeDescriptionWhen to Use
Full clusterAll namespaces and resourcesDaily scheduled backup
NamespaceSpecific namespace onlyBefore risky changes
ResourceSpecific resource typesTargeted backup
Volume snapshotPersistentVolume snapshotsDatabase volumes

Manual Backup

Create a manual backup before risky operations:

velero backup create pre-upgrade-backup \
  --include-namespaces matih-control-plane,matih-data-plane \
  --snapshot-volumes \
  --wait

Restore Procedures

Full Cluster Restore

velero restore create --from-backup matih-daily-backup-20250615 \
  --include-namespaces matih-control-plane,matih-data-plane \
  --wait

Namespace Restore

velero restore create --from-backup matih-daily-backup-20250615 \
  --include-namespaces matih-data-plane \
  --wait

Selective Resource Restore

velero restore create --from-backup matih-daily-backup-20250615 \
  --include-resources configmaps,secrets \
  --include-namespaces matih-monitoring \
  --wait

Backup Storage

ProviderStorageEncryption
AzureBlob StorageSSE with customer-managed keys
AWSS3SSE-KMS
GCPCloud StorageCMEK

Monitoring

MetricAlert ConditionDescription
Backup ageOver 48 hoursLast successful backup is too old
Backup sizeUnexpected changeMay indicate data loss or corruption
Backup failuresAny failureBackup job did not complete

Verification

Regularly test restores to verify backup integrity:

  1. Create a test namespace
  2. Restore a backup into the test namespace
  3. Verify resource counts and configurations
  4. Clean up the test namespace