Ops Workbench
The Ops Workbench is the platform's operational monitoring interface running on port 3006. It provides real-time service health dashboards, performance metrics visualization, alert management, incident tracking, deployment monitoring, infrastructure views, reliability analysis, and cost tracking. This interface serves SRE engineers, DevOps teams, and platform operators who need to monitor, diagnose, and respond to operational issues across the MATIH platform.
Application Structure
frontend/ops-workbench/src/
app/ # Application-level configuration
components/
chat/ # Ops AI chat assistant
dashboard/ # Dashboard widgets and layouts
health/ # Health monitoring components
layout/ # Layout components
hooks/
use-mops-data.ts # Managed operations data hook
pages/
AlertsPage.tsx # Alert management
ChatPage.tsx # AI-assisted operations chat
CostPage.tsx # Cost analysis and tracking
DashboardPage.tsx # Main operations dashboard
DeploymentsPage.tsx # Deployment monitoring
IncidentsPage.tsx # Incident management
InfrastructurePage.tsx # Infrastructure overview
ReliabilityPage.tsx # Reliability metrics and SLOs
services/ # Ops-specific service utilities
stores/ # Ops workbench state management
styles/ # Ops-specific styles
types/ # Ops-specific TypeScript types
utils/ # Ops utility functions
App.tsx # Root application component
main.tsx # Vite entry point
module.tsx # Module federation entryKey Numbers
| Metric | Value |
|---|---|
| Port | 3006 |
| Pages | 8 |
| Component groups | 4 |
| Custom hooks | 1 |
| Backend integrations | Ops Agent Service, Observability API, Infrastructure Service |
Page Architecture
The Ops Workbench organizes its functionality into eight dedicated pages:
| Page | Route | Purpose |
|---|---|---|
| Dashboard | /dashboard | Main operations overview with health widgets |
| Alerts | /alerts | Alert configuration, active alerts, alert history |
| Incidents | /incidents | Incident tracking, timeline, postmortems |
| Deployments | /deployments | Deployment status, rollback, version tracking |
| Infrastructure | /infrastructure | Node health, resource utilization, storage |
| Reliability | /reliability | SLOs, error budgets, availability metrics |
| Cost | /cost | Cost analysis, budget tracking, optimization |
| Chat | /chat | AI-assisted operational troubleshooting |
Operations Dashboard
The DashboardPage provides a single-pane-of-glass view of platform health:
Dashboard Layout
+-----------------------------------------------------------+
| Operations Dashboard [Time: Last 1h] |
+-----------------------------------------------------------+
| |
| +------ Platform Health ------+ +--- Active Alerts ---+ |
| | | | | |
| | Services: 24/24 healthy | | Critical: 0 | |
| | Pods: 48/48 running | | Warning: 3 | |
| | Uptime: 99.97% | | Info: 12 | |
| | | | | |
| +-----------------------------+ +---------------------+ |
| |
| +---- Request Rate ----------+ +--- Error Rate ------+ |
| | [Line chart: req/s over | | [Line chart: 5xx | |
| | time by service] | | rate over time] | |
| +-----------------------------+ +---------------------+ |
| |
| +---- P99 Latency -----------+ +--- Recent Events --+ |
| | [Line chart: p99 latency | | Deploy: ai-svc v3 | |
| | by endpoint] | | Alert: high CPU | |
| | | | Resolve: db conn | |
| +-----------------------------+ +---------------------+ |
| |
+-----------------------------------------------------------+Dashboard Widgets
| Widget | Data Source | Refresh |
|---|---|---|
| Platform health summary | Observability API | 10s (WebSocket push) |
| Active alerts | Ops Agent Service | Real-time (WebSocket) |
| Request rate | Prometheus (via API) | 15s |
| Error rate | Prometheus (via API) | 15s |
| P99 latency | Prometheus (via API) | 15s |
| Recent events | Event stream | Real-time (WebSocket) |
| Resource utilization | Kubernetes API (via API) | 30s |
| Top slow queries | Query Engine metrics | 60s |
Alert Management
The AlertsPage provides comprehensive alert configuration and monitoring:
Alert List
| Column | Description |
|---|---|
| Severity | Critical, Warning, Info |
| Alert name | Human-readable alert name |
| Service | Affected service |
| Tenant | Affected tenant (if tenant-specific) |
| Status | Firing, Resolved, Silenced, Acknowledged |
| Started | When the alert first fired |
| Duration | How long the alert has been active |
| Last notified | When the last notification was sent |
Alert Configuration
| Configuration | Options |
|---|---|
| Metric | Prometheus metric to monitor |
| Condition | Threshold, rate of change, absence |
| Duration | How long the condition must hold before firing |
| Severity | Critical, Warning, Info |
| Notification channels | Slack, PagerDuty, email, webhook |
| Silencing | Time-based or condition-based silence rules |
| Runbook | Link to runbook for resolution steps |
Alert Lifecycle
Condition met
|
v
[Pending] -- duration threshold met --> [Firing]
|
Notification sent
|
+-----------+----------+----------+
| | |
[Acknowledged] [Silenced] Condition cleared
| | |
v v v
Still firing Silence expires [Resolved]
| |
v v
[Firing] [Firing]Incident Management
The IncidentsPage provides incident tracking from detection through resolution and postmortem:
Incident Workflow
| Status | Description | Actions Available |
|---|---|---|
| Open | Incident detected, not yet assigned | Assign, set severity, add to timeline |
| Investigating | Team is actively investigating | Add updates, link alerts, escalate |
| Identified | Root cause identified | Document cause, begin remediation |
| Mitigated | Impact reduced or eliminated | Confirm mitigation, monitor |
| Resolved | Incident fully resolved | Close incident, schedule postmortem |
| Postmortem | Learning from the incident | Document findings, action items |
Incident Timeline
+-----------------------------------------------------------+
| Incident: INC-2024-0042 |
| Title: Elevated query latency in tenant acme-corp |
| Severity: P2 |
| Duration: 47 minutes |
+-----------------------------------------------------------+
| Timeline: |
| |
| 09:15 [Alert] P99 latency > 500ms for query engine |
| 09:17 [Auto] Incident created from alert |
| 09:18 [User: alice] Acknowledged, investigating |
| 09:22 [User: alice] Root cause: Trino worker OOM |
| 09:25 [User: bob] Scaling Trino workers 3 -> 5 |
| 09:30 [Auto] Latency returning to normal |
| 09:35 [User: alice] Confirmed resolution |
| 09:40 [User: alice] Incident resolved |
| |
| Impact: 12 queries timed out, 3 dashboards affected |
| Root cause: Memory spike from concurrent large queries |
| Action items: |
| [ ] Implement query cost limits per tenant |
| [ ] Add Trino memory monitoring alert |
+-----------------------------------------------------------+Deployment Monitoring
The DeploymentsPage tracks all service deployments across the platform:
Deployment List
| Column | Description |
|---|---|
| Service | Service being deployed |
| Version | New version being deployed |
| Previous version | Version being replaced |
| Environment | dev, staging, production |
| Status | Pending, In Progress, Completed, Failed, Rolled Back |
| Strategy | Rolling, Blue-Green, Canary |
| Started | Deployment start time |
| Duration | Time to complete |
| Initiated by | User or automation that triggered the deployment |
Deployment Detail
| Section | Content |
|---|---|
| Progress | Visual progress indicator showing deployment stages |
| Pod status | Real-time pod replacement progress |
| Health checks | Liveness and readiness probe results |
| Metrics comparison | Before/after comparison of error rate, latency, and throughput |
| Rollback | One-click rollback with confirmation |
| Logs | Deployment controller logs and pod startup logs |
Infrastructure Page
The InfrastructurePage provides a deep view of the underlying infrastructure:
Node Health
| Metric | Display |
|---|---|
| Node count | Total and by availability zone |
| CPU utilization | Per-node and cluster average |
| Memory utilization | Per-node with pressure indicators |
| Disk I/O | Read/write throughput and IOPS |
| Network | Ingress/egress bandwidth by node |
| Pod density | Pods scheduled per node vs. capacity |
Resource Quotas
+-----------------------------------------------------------+
| Resource Quotas by Namespace |
+-----------------------------------------------------------+
| Namespace CPU Memory Pods |
| matih-control-plane 4/8 cores 8/16 GiB 24/50 |
| matih-data-plane 12/24 cores 32/64 GiB 48/100 |
| matih-observability 2/4 cores 6/12 GiB 12/30 |
| tenant-acme-corp 3/8 cores 6/16 GiB 14/50 |
| tenant-globex 2/8 cores 4/16 GiB 14/50 |
+-----------------------------------------------------------+Reliability Page
The ReliabilityPage displays Service Level Objectives (SLOs) and error budgets:
SLO Dashboard
| SLO | Target | Current | Budget Remaining | Status |
|---|---|---|---|---|
| API availability | 99.9% | 99.95% | 72% | On track |
| Query latency P99 | under 500ms | 380ms | 85% | On track |
| Dashboard load time | under 3s | 2.4s | 90% | On track |
| AI response latency | under 5s | 4.2s | 35% | At risk |
| Data freshness | under 15min | 8min | 65% | On track |
Error Budget
The error budget visualization shows how much unreliability is "allowed" before the SLO is breached:
API Availability (99.9% SLO, 30-day window)
Total budget: 43.2 minutes of downtime
Consumed: 12.1 minutes (28%)
Remaining: 31.1 minutes (72%)
Burn rate: 0.8x (sustainable)
[=========|...................] 28% consumed
Days remaining in window: 18Cost Page
The CostPage provides cost analysis and optimization recommendations:
Cost Dashboard
| Widget | Content |
|---|---|
| Monthly cost | Current month running total with daily breakdown |
| Cost by service | Stacked area chart showing cost per service |
| Cost by tenant | Top tenants by resource consumption |
| Cost trends | 6-month trend with projection |
| Optimization recommendations | AI-generated cost reduction suggestions |
| Budget alerts | Tenants approaching budget limits |
Ops Chat
The ChatPage provides an AI-assisted operations chat for troubleshooting:
| Capability | Example |
|---|---|
| Issue diagnosis | "Why is the AI Service showing high latency?" |
| Metric queries | "Show me the P99 latency for query engine over the last hour" |
| Runbook execution | "Run the database connection pool health check" |
| Impact analysis | "What would happen if we restart the Trino coordinator?" |
| Historical analysis | "Were there any similar incidents in the last 30 days?" |
Managed Operations Data Hook
// frontend/ops-workbench/src/hooks/use-mops-data.ts
export function useMopsData() {
// Aggregated operations data from multiple sources
// Combines metrics from Prometheus, logs from Loki,
// alerts from Alertmanager, and incidents from the ops service
return {
health, // Platform health summary
alerts, // Active alerts
incidents, // Open incidents
deployments, // Recent deployments
metrics, // Key platform metrics
isLoading,
error,
refresh,
};
}Key Source Files
| Component | Location |
|---|---|
| Application entry | frontend/ops-workbench/src/App.tsx |
| Dashboard page | frontend/ops-workbench/src/pages/DashboardPage.tsx |
| Alerts page | frontend/ops-workbench/src/pages/AlertsPage.tsx |
| Incidents page | frontend/ops-workbench/src/pages/IncidentsPage.tsx |
| Deployments page | frontend/ops-workbench/src/pages/DeploymentsPage.tsx |
| Infrastructure page | frontend/ops-workbench/src/pages/InfrastructurePage.tsx |
| Reliability page | frontend/ops-workbench/src/pages/ReliabilityPage.tsx |
| Cost page | frontend/ops-workbench/src/pages/CostPage.tsx |
| Chat page | frontend/ops-workbench/src/pages/ChatPage.tsx |
| Health components | frontend/ops-workbench/src/components/health/ |
| Dashboard components | frontend/ops-workbench/src/components/dashboard/ |
| Ops chat | frontend/ops-workbench/src/components/chat/ |
| MOPS data hook | frontend/ops-workbench/src/hooks/use-mops-data.ts |