Incident Response
MATIH follows a structured incident response process triggered by alerts. The process covers detection, triage, investigation, mitigation, resolution, and post-mortem review. All incidents are tracked with a severity classification and documented timeline.
Incident Severity
| Severity | Definition | Response Time | Example |
|---|---|---|---|
| SEV-1 | Platform-wide outage affecting all tenants | Under 15 minutes | All services down, data loss |
| SEV-2 | Major feature unavailable for some tenants | Under 30 minutes | Query engine down, provisioning broken |
| SEV-3 | Minor feature degradation | Under 2 hours | Slow search, elevated error rate |
| SEV-4 | Cosmetic or non-impacting issue | Next business day | Dashboard rendering glitch |
Response Process
1. Detection
Alerts fire through Prometheus rules and are routed to the appropriate channels. Critical alerts page the on-call engineer via PagerDuty.
2. Acknowledgment
The on-call engineer acknowledges the alert within the response time SLA:
- PagerDuty: Acknowledge in the PagerDuty app
- Slack: React with a checkmark emoji and post "Investigating"
3. Triage
Assess the impact and assign severity:
- How many tenants are affected?
- Is data integrity at risk?
- Is there a workaround?
4. Investigation
Use the observability stack to diagnose the root cause:
- Check the Grafana dashboard linked in the alert annotation
- Examine relevant traces in Tempo
- Search logs in Loki for error details
- Check recent deployments or configuration changes
5. Mitigation
Apply immediate mitigation to restore service:
- Roll back recent deployment
- Scale up resources
- Restart affected services
- Enable circuit breakers or rate limiting
6. Resolution
Fully resolve the underlying issue and verify recovery:
- Deploy fix or configuration change
- Verify all health checks pass
- Monitor for recurrence
7. Post-Mortem
Conduct a blameless post-mortem within 48 hours:
- Timeline of events
- Root cause analysis
- Impact assessment
- Action items to prevent recurrence
On-Call Rotation
| Team | Coverage | Escalation |
|---|---|---|
| Platform Team | 24/7 | Primary on-call, then secondary, then engineering manager |
| Data Plane Team | Business hours + pager | Primary on-call, then team lead |
| Control Plane Team | Business hours + pager | Primary on-call, then team lead |
Runbook Integration
Every alert includes a runbook_url annotation linking to the relevant operational runbook. On-call engineers should follow the runbook as the first step in investigation.
Communication Template
For SEV-1 and SEV-2 incidents, post updates to the #matih-incidents Slack channel:
**Incident: [Brief description]**
**Severity:** SEV-[1/2]
**Status:** Investigating / Mitigating / Resolved
**Impact:** [Description of user impact]
**Started:** [Timestamp]
**Next update:** [Time of next update]