Runbooks Overview
Operational runbooks provide step-by-step procedures for common operational scenarios in the MATIH platform. Each runbook is linked from alert annotations so that on-call engineers can quickly find the relevant procedure when an alert fires.
Subsections
| Page | Description |
|---|---|
| Service Restart | Safely restarting services with pre and post checks |
| Database Recovery | PostgreSQL recovery procedures |
| Kafka Recovery | Kafka broker and consumer group recovery |
| Scaling Procedures | Horizontal and vertical scaling of services |
| Certificate Renewal | TLS certificate renewal procedures |
Runbook Format
Each runbook follows a standard format:
- Symptoms -- What alerts or behaviors trigger this runbook
- Impact -- What is affected and to what degree
- Prerequisites -- Tools, access, and context needed
- Steps -- Numbered step-by-step procedure
- Verification -- How to confirm the issue is resolved
- Escalation -- When and how to escalate
Mandatory Scripts
All operational actions must be performed through the approved scripts:
| Operation | Script |
|---|---|
| Check platform status | ./scripts/tools/platform-status.sh |
| Check service health | ./scripts/disaster-recovery/health-check.sh |
| Build and deploy a service | ./scripts/tools/service-build-deploy.sh |
| Full service rebuild | ./scripts/tools/full-service-rebuild.sh |