MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Runbooks Overview

Runbooks Overview

Operational runbooks provide step-by-step procedures for common operational scenarios in the MATIH platform. Each runbook is linked from alert annotations so that on-call engineers can quickly find the relevant procedure when an alert fires.


Subsections

PageDescription
Service RestartSafely restarting services with pre and post checks
Database RecoveryPostgreSQL recovery procedures
Kafka RecoveryKafka broker and consumer group recovery
Scaling ProceduresHorizontal and vertical scaling of services
Certificate RenewalTLS certificate renewal procedures

Runbook Format

Each runbook follows a standard format:

  1. Symptoms -- What alerts or behaviors trigger this runbook
  2. Impact -- What is affected and to what degree
  3. Prerequisites -- Tools, access, and context needed
  4. Steps -- Numbered step-by-step procedure
  5. Verification -- How to confirm the issue is resolved
  6. Escalation -- When and how to escalate

Mandatory Scripts

All operational actions must be performed through the approved scripts:

OperationScript
Check platform status./scripts/tools/platform-status.sh
Check service health./scripts/disaster-recovery/health-check.sh
Build and deploy a service./scripts/tools/service-build-deploy.sh
Full service rebuild./scripts/tools/full-service-rebuild.sh