Ops Workbench

The Ops Workbench is the platform's operational monitoring interface running on port 3006. It provides real-time service health dashboards, performance metrics visualization, alert management, incident tracking, deployment monitoring, infrastructure views, reliability analysis, and cost tracking. This interface serves SRE engineers, DevOps teams, and platform operators who need to monitor, diagnose, and respond to operational issues across the MATIH platform.

Application Structure

frontend/ops-workbench/src/
  app/                    # Application-level configuration
  components/
    chat/                 # Ops AI chat assistant
    dashboard/            # Dashboard widgets and layouts
    health/               # Health monitoring components
    layout/               # Layout components
  hooks/
    use-mops-data.ts      # Managed operations data hook
  pages/
    AlertsPage.tsx        # Alert management
    ChatPage.tsx          # AI-assisted operations chat
    CostPage.tsx          # Cost analysis and tracking
    DashboardPage.tsx     # Main operations dashboard
    DeploymentsPage.tsx   # Deployment monitoring
    IncidentsPage.tsx     # Incident management
    InfrastructurePage.tsx # Infrastructure overview
    ReliabilityPage.tsx   # Reliability metrics and SLOs
  services/               # Ops-specific service utilities
  stores/                 # Ops workbench state management
  styles/                 # Ops-specific styles
  types/                  # Ops-specific TypeScript types
  utils/                  # Ops utility functions
  App.tsx                 # Root application component
  main.tsx                # Vite entry point
  module.tsx              # Module federation entry

Key Numbers

Metric	Value
Port	3006
Pages	8
Component groups	4
Custom hooks	1
Backend integrations	Ops Agent Service, Observability API, Infrastructure Service

Page Architecture

The Ops Workbench organizes its functionality into eight dedicated pages:

Page	Route	Purpose
Dashboard	`/dashboard`	Main operations overview with health widgets
Alerts	`/alerts`	Alert configuration, active alerts, alert history
Incidents	`/incidents`	Incident tracking, timeline, postmortems
Deployments	`/deployments`	Deployment status, rollback, version tracking
Infrastructure	`/infrastructure`	Node health, resource utilization, storage
Reliability	`/reliability`	SLOs, error budgets, availability metrics
Cost	`/cost`	Cost analysis, budget tracking, optimization
Chat	`/chat`	AI-assisted operational troubleshooting

Operations Dashboard

The DashboardPage provides a single-pane-of-glass view of platform health:

Dashboard Layout

+-----------------------------------------------------------+
|  Operations Dashboard                    [Time: Last 1h]   |
+-----------------------------------------------------------+
|                                                            |
|  +------ Platform Health ------+  +--- Active Alerts ---+ |
|  |                             |  |                     |  |
|  |  Services: 24/24 healthy    |  |  Critical: 0       |  |
|  |  Pods: 48/48 running       |  |  Warning:  3       |  |
|  |  Uptime: 99.97%            |  |  Info:     12      |  |
|  |                             |  |                     |  |
|  +-----------------------------+  +---------------------+  |
|                                                            |
|  +---- Request Rate ----------+  +--- Error Rate ------+   |
|  |  [Line chart: req/s over   |  |  [Line chart: 5xx   |   |
|  |   time by service]         |  |   rate over time]   |   |
|  +-----------------------------+  +---------------------+  |
|                                                            |
|  +---- P99 Latency -----------+  +--- Recent Events --+   |
|  |  [Line chart: p99 latency  |  |  Deploy: ai-svc v3 |   |
|  |   by endpoint]             |  |  Alert: high CPU    |   |
|  |                             |  |  Resolve: db conn  |   |
|  +-----------------------------+  +---------------------+  |
|                                                            |
+-----------------------------------------------------------+

Dashboard Widgets

Widget	Data Source	Refresh
Platform health summary	Observability API	10s (WebSocket push)
Active alerts	Ops Agent Service	Real-time (WebSocket)
Request rate	Prometheus (via API)	15s
Error rate	Prometheus (via API)	15s
P99 latency	Prometheus (via API)	15s
Recent events	Event stream	Real-time (WebSocket)
Resource utilization	Kubernetes API (via API)	30s
Top slow queries	Query Engine metrics	60s

Alert Management

The AlertsPage provides comprehensive alert configuration and monitoring:

Alert List

Column	Description
Severity	Critical, Warning, Info
Alert name	Human-readable alert name
Service	Affected service
Tenant	Affected tenant (if tenant-specific)
Status	Firing, Resolved, Silenced, Acknowledged
Started	When the alert first fired
Duration	How long the alert has been active
Last notified	When the last notification was sent

Alert Configuration

Configuration	Options
Metric	Prometheus metric to monitor
Condition	Threshold, rate of change, absence
Duration	How long the condition must hold before firing
Severity	Critical, Warning, Info
Notification channels	Slack, PagerDuty, email, webhook
Silencing	Time-based or condition-based silence rules
Runbook	Link to runbook for resolution steps

Alert Lifecycle

Condition met
    |
    v
[Pending] -- duration threshold met --> [Firing]
                                           |
                                     Notification sent
                                           |
                    +-----------+----------+----------+
                    |           |                     |
              [Acknowledged]  [Silenced]        Condition cleared
                    |           |                     |
                    v           v                     v
               Still firing  Silence expires      [Resolved]
                    |           |
                    v           v
              [Firing]    [Firing]

Incident Management

The IncidentsPage provides incident tracking from detection through resolution and postmortem:

Incident Workflow

Status	Description	Actions Available
Open	Incident detected, not yet assigned	Assign, set severity, add to timeline
Investigating	Team is actively investigating	Add updates, link alerts, escalate
Identified	Root cause identified	Document cause, begin remediation
Mitigated	Impact reduced or eliminated	Confirm mitigation, monitor
Resolved	Incident fully resolved	Close incident, schedule postmortem
Postmortem	Learning from the incident	Document findings, action items

Incident Timeline

+-----------------------------------------------------------+
|  Incident: INC-2024-0042                                   |
|  Title: Elevated query latency in tenant acme-corp         |
|  Severity: P2                                              |
|  Duration: 47 minutes                                      |
+-----------------------------------------------------------+
|  Timeline:                                                 |
|                                                            |
|  09:15  [Alert] P99 latency > 500ms for query engine       |
|  09:17  [Auto] Incident created from alert                 |
|  09:18  [User: alice] Acknowledged, investigating          |
|  09:22  [User: alice] Root cause: Trino worker OOM         |
|  09:25  [User: bob] Scaling Trino workers 3 -> 5           |
|  09:30  [Auto] Latency returning to normal                 |
|  09:35  [User: alice] Confirmed resolution                 |
|  09:40  [User: alice] Incident resolved                    |
|                                                            |
|  Impact: 12 queries timed out, 3 dashboards affected       |
|  Root cause: Memory spike from concurrent large queries     |
|  Action items:                                             |
|    [ ] Implement query cost limits per tenant               |
|    [ ] Add Trino memory monitoring alert                    |
+-----------------------------------------------------------+

Deployment Monitoring

The DeploymentsPage tracks all service deployments across the platform:

Deployment List

Column	Description
Service	Service being deployed
Version	New version being deployed
Previous version	Version being replaced
Environment	dev, staging, production
Status	Pending, In Progress, Completed, Failed, Rolled Back
Strategy	Rolling, Blue-Green, Canary
Started	Deployment start time
Duration	Time to complete
Initiated by	User or automation that triggered the deployment

Deployment Detail

Section	Content
Progress	Visual progress indicator showing deployment stages
Pod status	Real-time pod replacement progress
Health checks	Liveness and readiness probe results
Metrics comparison	Before/after comparison of error rate, latency, and throughput
Rollback	One-click rollback with confirmation
Logs	Deployment controller logs and pod startup logs

Infrastructure Page

The InfrastructurePage provides a deep view of the underlying infrastructure:

Node Health

Metric	Display
Node count	Total and by availability zone
CPU utilization	Per-node and cluster average
Memory utilization	Per-node with pressure indicators
Disk I/O	Read/write throughput and IOPS
Network	Ingress/egress bandwidth by node
Pod density	Pods scheduled per node vs. capacity

Resource Quotas

+-----------------------------------------------------------+
|  Resource Quotas by Namespace                              |
+-----------------------------------------------------------+
|  Namespace              CPU         Memory      Pods       |
|  matih-control-plane    4/8 cores   8/16 GiB   24/50     |
|  matih-data-plane       12/24 cores 32/64 GiB  48/100    |
|  matih-observability    2/4 cores   6/12 GiB   12/30     |
|  tenant-acme-corp       3/8 cores   6/16 GiB   14/50     |
|  tenant-globex          2/8 cores   4/16 GiB   14/50     |
+-----------------------------------------------------------+

Reliability Page

The ReliabilityPage displays Service Level Objectives (SLOs) and error budgets:

SLO Dashboard

SLO	Target	Current	Budget Remaining	Status
API availability	99.9%	99.95%	72%	On track
Query latency P99	under 500ms	380ms	85%	On track
Dashboard load time	under 3s	2.4s	90%	On track
AI response latency	under 5s	4.2s	35%	At risk
Data freshness	under 15min	8min	65%	On track

Error Budget

The error budget visualization shows how much unreliability is "allowed" before the SLO is breached:

API Availability (99.9% SLO, 30-day window)
  Total budget: 43.2 minutes of downtime
  Consumed: 12.1 minutes (28%)
  Remaining: 31.1 minutes (72%)
  Burn rate: 0.8x (sustainable)

  [=========|...................] 28% consumed
  Days remaining in window: 18

Cost Page

The CostPage provides cost analysis and optimization recommendations:

Cost Dashboard

Widget	Content
Monthly cost	Current month running total with daily breakdown
Cost by service	Stacked area chart showing cost per service
Cost by tenant	Top tenants by resource consumption
Cost trends	6-month trend with projection
Optimization recommendations	AI-generated cost reduction suggestions
Budget alerts	Tenants approaching budget limits

Ops Chat

The ChatPage provides an AI-assisted operations chat for troubleshooting:

Capability	Example
Issue diagnosis	"Why is the AI Service showing high latency?"
Metric queries	"Show me the P99 latency for query engine over the last hour"
Runbook execution	"Run the database connection pool health check"
Impact analysis	"What would happen if we restart the Trino coordinator?"
Historical analysis	"Were there any similar incidents in the last 30 days?"

Managed Operations Data Hook

// frontend/ops-workbench/src/hooks/use-mops-data.ts
export function useMopsData() {
  // Aggregated operations data from multiple sources
  // Combines metrics from Prometheus, logs from Loki,
  // alerts from Alertmanager, and incidents from the ops service
 
  return {
    health,           // Platform health summary
    alerts,           // Active alerts
    incidents,        // Open incidents
    deployments,      // Recent deployments
    metrics,          // Key platform metrics
    isLoading,
    error,
    refresh,
  };
}

Key Source Files

Component	Location
Application entry	`frontend/ops-workbench/src/App.tsx`
Dashboard page	`frontend/ops-workbench/src/pages/DashboardPage.tsx`
Alerts page	`frontend/ops-workbench/src/pages/AlertsPage.tsx`
Incidents page	`frontend/ops-workbench/src/pages/IncidentsPage.tsx`
Deployments page	`frontend/ops-workbench/src/pages/DeploymentsPage.tsx`
Infrastructure page	`frontend/ops-workbench/src/pages/InfrastructurePage.tsx`
Reliability page	`frontend/ops-workbench/src/pages/ReliabilityPage.tsx`
Cost page	`frontend/ops-workbench/src/pages/CostPage.tsx`
Chat page	`frontend/ops-workbench/src/pages/ChatPage.tsx`
Health components	`frontend/ops-workbench/src/components/health/`
Dashboard components	`frontend/ops-workbench/src/components/dashboard/`
Ops chat	`frontend/ops-workbench/src/components/chat/`
MOPS data hook	`frontend/ops-workbench/src/hooks/use-mops-data.ts`

Control Plane UI Onboarding Wizard