MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Incident Response

Incident Response

MATIH follows a structured incident response process triggered by alerts. The process covers detection, triage, investigation, mitigation, resolution, and post-mortem review. All incidents are tracked with a severity classification and documented timeline.


Incident Severity

SeverityDefinitionResponse TimeExample
SEV-1Platform-wide outage affecting all tenantsUnder 15 minutesAll services down, data loss
SEV-2Major feature unavailable for some tenantsUnder 30 minutesQuery engine down, provisioning broken
SEV-3Minor feature degradationUnder 2 hoursSlow search, elevated error rate
SEV-4Cosmetic or non-impacting issueNext business dayDashboard rendering glitch

Response Process

1. Detection

Alerts fire through Prometheus rules and are routed to the appropriate channels. Critical alerts page the on-call engineer via PagerDuty.

2. Acknowledgment

The on-call engineer acknowledges the alert within the response time SLA:

  • PagerDuty: Acknowledge in the PagerDuty app
  • Slack: React with a checkmark emoji and post "Investigating"

3. Triage

Assess the impact and assign severity:

  • How many tenants are affected?
  • Is data integrity at risk?
  • Is there a workaround?

4. Investigation

Use the observability stack to diagnose the root cause:

  1. Check the Grafana dashboard linked in the alert annotation
  2. Examine relevant traces in Tempo
  3. Search logs in Loki for error details
  4. Check recent deployments or configuration changes

5. Mitigation

Apply immediate mitigation to restore service:

  • Roll back recent deployment
  • Scale up resources
  • Restart affected services
  • Enable circuit breakers or rate limiting

6. Resolution

Fully resolve the underlying issue and verify recovery:

  • Deploy fix or configuration change
  • Verify all health checks pass
  • Monitor for recurrence

7. Post-Mortem

Conduct a blameless post-mortem within 48 hours:

  • Timeline of events
  • Root cause analysis
  • Impact assessment
  • Action items to prevent recurrence

On-Call Rotation

TeamCoverageEscalation
Platform Team24/7Primary on-call, then secondary, then engineering manager
Data Plane TeamBusiness hours + pagerPrimary on-call, then team lead
Control Plane TeamBusiness hours + pagerPrimary on-call, then team lead

Runbook Integration

Every alert includes a runbook_url annotation linking to the relevant operational runbook. On-call engineers should follow the runbook as the first step in investigation.


Communication Template

For SEV-1 and SEV-2 incidents, post updates to the #matih-incidents Slack channel:

**Incident: [Brief description]**
**Severity:** SEV-[1/2]
**Status:** Investigating / Mitigating / Resolved
**Impact:** [Description of user impact]
**Started:** [Timestamp]
**Next update:** [Time of next update]