Incident Management

The Incident Management page provides tools for tracking, responding to, and resolving platform incidents. It supports the full incident lifecycle from detection through resolution and postmortem, with integration into alerting, on-call schedules, and runbook automation.

Incident Lifecycle

Stage	Status	Actions Available
Detection	`triggered`	Auto-created from alert or manual creation
Acknowledgement	`acknowledged`	Assign responder, set severity
Investigation	`investigating`	Add notes, link evidence, run diagnostics
Mitigation	`mitigating`	Apply fix, scale resources, rollback
Resolution	`resolved`	Mark resolved, set root cause
Postmortem	`postmortem`	Write postmortem, action items

Incident List

The main view shows active and recent incidents:

Column	Description	Sortable
ID	Incident identifier	Yes
Title	Brief incident description	Yes
Severity	Critical, High, Medium, Low	Yes
Status	Current lifecycle stage	Yes
Assignee	Current responder	Yes
Duration	Time since creation or time to resolution	Yes
Affected Services	Services impacted	No
Created	Incident creation time	Yes

Create Incident

interface CreateIncidentRequest {
  title: string;
  description: string;
  severity: 'critical' | 'high' | 'medium' | 'low';
  affected_services: string[];
  assignee?: string;
  related_alerts?: string[];
}

Incident Detail View

The detail view provides a timeline-based incident workspace:

Section	Content
Summary	Title, severity, status, assignee, duration
Timeline	Chronological event log with notes and actions
Evidence	Linked metrics, logs, traces, and screenshots
Affected Services	List of impacted services with health status
Communication	Status updates sent to stakeholders
Action Items	Tasks generated during investigation

Severity Definitions

Severity	Impact	Response Time	Notification
Critical	Platform-wide outage	Immediate	Page on-call, notify leadership
High	Major feature unavailable	15 minutes	Page on-call
Medium	Degraded performance	1 hour	Notify team channel
Low	Minor issue, workaround exists	Next business day	Team ticket

Runbook Integration

Incidents can be linked to runbooks that provide step-by-step resolution guides:

Service	Common Runbooks
AI Service	LLM provider failover, high latency investigation
Query Engine	Slow query investigation, connection pool exhaustion
Kafka	Consumer lag resolution, partition rebalance
PostgreSQL	Connection limit reached, replication lag

Status Page Updates

Incident status updates are published to stakeholders:

interface StatusUpdate {
  incident_id: string;
  message: string;
  status: string;
  visibility: 'internal' | 'external';
  created_by: string;
  created_at: string;
}

Postmortem Template

After resolution, a structured postmortem is generated:

Section	Content
Summary	What happened, duration, impact
Timeline	Key events in chronological order
Root Cause	Technical root cause analysis
Impact	Users affected, data impact, financial impact
Resolution	Steps taken to resolve
Action Items	Preventive measures with owners and due dates
Lessons Learned	What went well, what could improve

Metrics

Metric	Description
MTTR	Mean time to resolution by severity
MTTA	Mean time to acknowledgement
Incident frequency	Incidents per week by severity
SLA compliance	Percentage meeting response time SLA

Observability & Health Chat Interface