MATIH Platform is in active MVP development. Documentation reflects current implementation status.
15. Workbench Architecture
Ops Workbench
Incident Management

Incident Management

The Incident Management page provides tools for tracking, responding to, and resolving platform incidents. It supports the full incident lifecycle from detection through resolution and postmortem, with integration into alerting, on-call schedules, and runbook automation.


Incident Lifecycle

StageStatusActions Available
DetectiontriggeredAuto-created from alert or manual creation
AcknowledgementacknowledgedAssign responder, set severity
InvestigationinvestigatingAdd notes, link evidence, run diagnostics
MitigationmitigatingApply fix, scale resources, rollback
ResolutionresolvedMark resolved, set root cause
PostmortempostmortemWrite postmortem, action items

Incident List

The main view shows active and recent incidents:

ColumnDescriptionSortable
IDIncident identifierYes
TitleBrief incident descriptionYes
SeverityCritical, High, Medium, LowYes
StatusCurrent lifecycle stageYes
AssigneeCurrent responderYes
DurationTime since creation or time to resolutionYes
Affected ServicesServices impactedNo
CreatedIncident creation timeYes

Create Incident

interface CreateIncidentRequest {
  title: string;
  description: string;
  severity: 'critical' | 'high' | 'medium' | 'low';
  affected_services: string[];
  assignee?: string;
  related_alerts?: string[];
}

Incident Detail View

The detail view provides a timeline-based incident workspace:

SectionContent
SummaryTitle, severity, status, assignee, duration
TimelineChronological event log with notes and actions
EvidenceLinked metrics, logs, traces, and screenshots
Affected ServicesList of impacted services with health status
CommunicationStatus updates sent to stakeholders
Action ItemsTasks generated during investigation

Severity Definitions

SeverityImpactResponse TimeNotification
CriticalPlatform-wide outageImmediatePage on-call, notify leadership
HighMajor feature unavailable15 minutesPage on-call
MediumDegraded performance1 hourNotify team channel
LowMinor issue, workaround existsNext business dayTeam ticket

Runbook Integration

Incidents can be linked to runbooks that provide step-by-step resolution guides:

ServiceCommon Runbooks
AI ServiceLLM provider failover, high latency investigation
Query EngineSlow query investigation, connection pool exhaustion
KafkaConsumer lag resolution, partition rebalance
PostgreSQLConnection limit reached, replication lag

Status Page Updates

Incident status updates are published to stakeholders:

interface StatusUpdate {
  incident_id: string;
  message: string;
  status: string;
  visibility: 'internal' | 'external';
  created_by: string;
  created_at: string;
}

Postmortem Template

After resolution, a structured postmortem is generated:

SectionContent
SummaryWhat happened, duration, impact
TimelineKey events in chronological order
Root CauseTechnical root cause analysis
ImpactUsers affected, data impact, financial impact
ResolutionSteps taken to resolve
Action ItemsPreventive measures with owners and due dates
Lessons LearnedWhat went well, what could improve

Metrics

MetricDescription
MTTRMean time to resolution by severity
MTTAMean time to acknowledgement
Incident frequencyIncidents per week by severity
SLA compliancePercentage meeting response time SLA