Incident Management
The Incident Management page provides tools for tracking, responding to, and resolving platform incidents. It supports the full incident lifecycle from detection through resolution and postmortem, with integration into alerting, on-call schedules, and runbook automation.
Incident Lifecycle
| Stage | Status | Actions Available |
|---|---|---|
| Detection | triggered | Auto-created from alert or manual creation |
| Acknowledgement | acknowledged | Assign responder, set severity |
| Investigation | investigating | Add notes, link evidence, run diagnostics |
| Mitigation | mitigating | Apply fix, scale resources, rollback |
| Resolution | resolved | Mark resolved, set root cause |
| Postmortem | postmortem | Write postmortem, action items |
Incident List
The main view shows active and recent incidents:
| Column | Description | Sortable |
|---|---|---|
| ID | Incident identifier | Yes |
| Title | Brief incident description | Yes |
| Severity | Critical, High, Medium, Low | Yes |
| Status | Current lifecycle stage | Yes |
| Assignee | Current responder | Yes |
| Duration | Time since creation or time to resolution | Yes |
| Affected Services | Services impacted | No |
| Created | Incident creation time | Yes |
Create Incident
interface CreateIncidentRequest {
title: string;
description: string;
severity: 'critical' | 'high' | 'medium' | 'low';
affected_services: string[];
assignee?: string;
related_alerts?: string[];
}Incident Detail View
The detail view provides a timeline-based incident workspace:
| Section | Content |
|---|---|
| Summary | Title, severity, status, assignee, duration |
| Timeline | Chronological event log with notes and actions |
| Evidence | Linked metrics, logs, traces, and screenshots |
| Affected Services | List of impacted services with health status |
| Communication | Status updates sent to stakeholders |
| Action Items | Tasks generated during investigation |
Severity Definitions
| Severity | Impact | Response Time | Notification |
|---|---|---|---|
| Critical | Platform-wide outage | Immediate | Page on-call, notify leadership |
| High | Major feature unavailable | 15 minutes | Page on-call |
| Medium | Degraded performance | 1 hour | Notify team channel |
| Low | Minor issue, workaround exists | Next business day | Team ticket |
Runbook Integration
Incidents can be linked to runbooks that provide step-by-step resolution guides:
| Service | Common Runbooks |
|---|---|
| AI Service | LLM provider failover, high latency investigation |
| Query Engine | Slow query investigation, connection pool exhaustion |
| Kafka | Consumer lag resolution, partition rebalance |
| PostgreSQL | Connection limit reached, replication lag |
Status Page Updates
Incident status updates are published to stakeholders:
interface StatusUpdate {
incident_id: string;
message: string;
status: string;
visibility: 'internal' | 'external';
created_by: string;
created_at: string;
}Postmortem Template
After resolution, a structured postmortem is generated:
| Section | Content |
|---|---|
| Summary | What happened, duration, impact |
| Timeline | Key events in chronological order |
| Root Cause | Technical root cause analysis |
| Impact | Users affected, data impact, financial impact |
| Resolution | Steps taken to resolve |
| Action Items | Preventive measures with owners and due dates |
| Lessons Learned | What went well, what could improve |
Metrics
| Metric | Description |
|---|---|
| MTTR | Mean time to resolution by severity |
| MTTA | Mean time to acknowledgement |
| Incident frequency | Incidents per week by severity |
| SLA compliance | Percentage meeting response time SLA |