Incident Response

MATIH follows a structured incident response process triggered by alerts. The process covers detection, triage, investigation, mitigation, resolution, and post-mortem review. All incidents are tracked with a severity classification and documented timeline.

Incident Severity

Severity	Definition	Response Time	Example
SEV-1	Platform-wide outage affecting all tenants	Under 15 minutes	All services down, data loss
SEV-2	Major feature unavailable for some tenants	Under 30 minutes	Query engine down, provisioning broken
SEV-3	Minor feature degradation	Under 2 hours	Slow search, elevated error rate
SEV-4	Cosmetic or non-impacting issue	Next business day	Dashboard rendering glitch

Response Process

1. Detection

Alerts fire through Prometheus rules and are routed to the appropriate channels. Critical alerts page the on-call engineer via PagerDuty.

2. Acknowledgment

The on-call engineer acknowledges the alert within the response time SLA:

PagerDuty: Acknowledge in the PagerDuty app
Slack: React with a checkmark emoji and post "Investigating"

3. Triage

Assess the impact and assign severity:

How many tenants are affected?
Is data integrity at risk?
Is there a workaround?

4. Investigation

Use the observability stack to diagnose the root cause:

Check the Grafana dashboard linked in the alert annotation
Examine relevant traces in Tempo
Search logs in Loki for error details
Check recent deployments or configuration changes

5. Mitigation

Apply immediate mitigation to restore service:

Roll back recent deployment
Scale up resources
Restart affected services
Enable circuit breakers or rate limiting

6. Resolution

Fully resolve the underlying issue and verify recovery:

Deploy fix or configuration change
Verify all health checks pass
Monitor for recurrence

7. Post-Mortem

Conduct a blameless post-mortem within 48 hours:

Timeline of events
Root cause analysis
Impact assessment
Action items to prevent recurrence

On-Call Rotation

Team	Coverage	Escalation
Platform Team	24/7	Primary on-call, then secondary, then engineering manager
Data Plane Team	Business hours + pager	Primary on-call, then team lead
Control Plane Team	Business hours + pager	Primary on-call, then team lead

Runbook Integration

Every alert includes a runbook_url annotation linking to the relevant operational runbook. On-call engineers should follow the runbook as the first step in investigation.

Communication Template

For SEV-1 and SEV-2 incidents, post updates to the #matih-incidents Slack channel:

**Incident: [Brief description]**
**Severity:** SEV-[1/2]
**Status:** Investigating / Mitigating / Resolved
**Impact:** [Description of user impact]
**Started:** [Timestamp]
**Next update:** [Time of next update]

Notification Channels Operational Runbooks