MATIH Platform is in active MVP development. Documentation reflects current implementation status.
16. User Experience
Ops Workbench

Ops Workbench

The Ops Workbench is the platform's operational monitoring interface running on port 3006. It provides real-time service health dashboards, performance metrics visualization, alert management, incident tracking, deployment monitoring, infrastructure views, reliability analysis, and cost tracking. This interface serves SRE engineers, DevOps teams, and platform operators who need to monitor, diagnose, and respond to operational issues across the MATIH platform.


Application Structure

frontend/ops-workbench/src/
  app/                    # Application-level configuration
  components/
    chat/                 # Ops AI chat assistant
    dashboard/            # Dashboard widgets and layouts
    health/               # Health monitoring components
    layout/               # Layout components
  hooks/
    use-mops-data.ts      # Managed operations data hook
  pages/
    AlertsPage.tsx        # Alert management
    ChatPage.tsx          # AI-assisted operations chat
    CostPage.tsx          # Cost analysis and tracking
    DashboardPage.tsx     # Main operations dashboard
    DeploymentsPage.tsx   # Deployment monitoring
    IncidentsPage.tsx     # Incident management
    InfrastructurePage.tsx # Infrastructure overview
    ReliabilityPage.tsx   # Reliability metrics and SLOs
  services/               # Ops-specific service utilities
  stores/                 # Ops workbench state management
  styles/                 # Ops-specific styles
  types/                  # Ops-specific TypeScript types
  utils/                  # Ops utility functions
  App.tsx                 # Root application component
  main.tsx                # Vite entry point
  module.tsx              # Module federation entry

Key Numbers

MetricValue
Port3006
Pages8
Component groups4
Custom hooks1
Backend integrationsOps Agent Service, Observability API, Infrastructure Service

Page Architecture

The Ops Workbench organizes its functionality into eight dedicated pages:

PageRoutePurpose
Dashboard/dashboardMain operations overview with health widgets
Alerts/alertsAlert configuration, active alerts, alert history
Incidents/incidentsIncident tracking, timeline, postmortems
Deployments/deploymentsDeployment status, rollback, version tracking
Infrastructure/infrastructureNode health, resource utilization, storage
Reliability/reliabilitySLOs, error budgets, availability metrics
Cost/costCost analysis, budget tracking, optimization
Chat/chatAI-assisted operational troubleshooting

Operations Dashboard

The DashboardPage provides a single-pane-of-glass view of platform health:

Dashboard Layout

+-----------------------------------------------------------+
|  Operations Dashboard                    [Time: Last 1h]   |
+-----------------------------------------------------------+
|                                                            |
|  +------ Platform Health ------+  +--- Active Alerts ---+ |
|  |                             |  |                     |  |
|  |  Services: 24/24 healthy    |  |  Critical: 0       |  |
|  |  Pods: 48/48 running       |  |  Warning:  3       |  |
|  |  Uptime: 99.97%            |  |  Info:     12      |  |
|  |                             |  |                     |  |
|  +-----------------------------+  +---------------------+  |
|                                                            |
|  +---- Request Rate ----------+  +--- Error Rate ------+   |
|  |  [Line chart: req/s over   |  |  [Line chart: 5xx   |   |
|  |   time by service]         |  |   rate over time]   |   |
|  +-----------------------------+  +---------------------+  |
|                                                            |
|  +---- P99 Latency -----------+  +--- Recent Events --+   |
|  |  [Line chart: p99 latency  |  |  Deploy: ai-svc v3 |   |
|  |   by endpoint]             |  |  Alert: high CPU    |   |
|  |                             |  |  Resolve: db conn  |   |
|  +-----------------------------+  +---------------------+  |
|                                                            |
+-----------------------------------------------------------+

Dashboard Widgets

WidgetData SourceRefresh
Platform health summaryObservability API10s (WebSocket push)
Active alertsOps Agent ServiceReal-time (WebSocket)
Request ratePrometheus (via API)15s
Error ratePrometheus (via API)15s
P99 latencyPrometheus (via API)15s
Recent eventsEvent streamReal-time (WebSocket)
Resource utilizationKubernetes API (via API)30s
Top slow queriesQuery Engine metrics60s

Alert Management

The AlertsPage provides comprehensive alert configuration and monitoring:

Alert List

ColumnDescription
SeverityCritical, Warning, Info
Alert nameHuman-readable alert name
ServiceAffected service
TenantAffected tenant (if tenant-specific)
StatusFiring, Resolved, Silenced, Acknowledged
StartedWhen the alert first fired
DurationHow long the alert has been active
Last notifiedWhen the last notification was sent

Alert Configuration

ConfigurationOptions
MetricPrometheus metric to monitor
ConditionThreshold, rate of change, absence
DurationHow long the condition must hold before firing
SeverityCritical, Warning, Info
Notification channelsSlack, PagerDuty, email, webhook
SilencingTime-based or condition-based silence rules
RunbookLink to runbook for resolution steps

Alert Lifecycle

Condition met
    |
    v
[Pending] -- duration threshold met --> [Firing]
                                           |
                                     Notification sent
                                           |
                    +-----------+----------+----------+
                    |           |                     |
              [Acknowledged]  [Silenced]        Condition cleared
                    |           |                     |
                    v           v                     v
               Still firing  Silence expires      [Resolved]
                    |           |
                    v           v
              [Firing]    [Firing]

Incident Management

The IncidentsPage provides incident tracking from detection through resolution and postmortem:

Incident Workflow

StatusDescriptionActions Available
OpenIncident detected, not yet assignedAssign, set severity, add to timeline
InvestigatingTeam is actively investigatingAdd updates, link alerts, escalate
IdentifiedRoot cause identifiedDocument cause, begin remediation
MitigatedImpact reduced or eliminatedConfirm mitigation, monitor
ResolvedIncident fully resolvedClose incident, schedule postmortem
PostmortemLearning from the incidentDocument findings, action items

Incident Timeline

+-----------------------------------------------------------+
|  Incident: INC-2024-0042                                   |
|  Title: Elevated query latency in tenant acme-corp         |
|  Severity: P2                                              |
|  Duration: 47 minutes                                      |
+-----------------------------------------------------------+
|  Timeline:                                                 |
|                                                            |
|  09:15  [Alert] P99 latency > 500ms for query engine       |
|  09:17  [Auto] Incident created from alert                 |
|  09:18  [User: alice] Acknowledged, investigating          |
|  09:22  [User: alice] Root cause: Trino worker OOM         |
|  09:25  [User: bob] Scaling Trino workers 3 -> 5           |
|  09:30  [Auto] Latency returning to normal                 |
|  09:35  [User: alice] Confirmed resolution                 |
|  09:40  [User: alice] Incident resolved                    |
|                                                            |
|  Impact: 12 queries timed out, 3 dashboards affected       |
|  Root cause: Memory spike from concurrent large queries     |
|  Action items:                                             |
|    [ ] Implement query cost limits per tenant               |
|    [ ] Add Trino memory monitoring alert                    |
+-----------------------------------------------------------+

Deployment Monitoring

The DeploymentsPage tracks all service deployments across the platform:

Deployment List

ColumnDescription
ServiceService being deployed
VersionNew version being deployed
Previous versionVersion being replaced
Environmentdev, staging, production
StatusPending, In Progress, Completed, Failed, Rolled Back
StrategyRolling, Blue-Green, Canary
StartedDeployment start time
DurationTime to complete
Initiated byUser or automation that triggered the deployment

Deployment Detail

SectionContent
ProgressVisual progress indicator showing deployment stages
Pod statusReal-time pod replacement progress
Health checksLiveness and readiness probe results
Metrics comparisonBefore/after comparison of error rate, latency, and throughput
RollbackOne-click rollback with confirmation
LogsDeployment controller logs and pod startup logs

Infrastructure Page

The InfrastructurePage provides a deep view of the underlying infrastructure:

Node Health

MetricDisplay
Node countTotal and by availability zone
CPU utilizationPer-node and cluster average
Memory utilizationPer-node with pressure indicators
Disk I/ORead/write throughput and IOPS
NetworkIngress/egress bandwidth by node
Pod densityPods scheduled per node vs. capacity

Resource Quotas

+-----------------------------------------------------------+
|  Resource Quotas by Namespace                              |
+-----------------------------------------------------------+
|  Namespace              CPU         Memory      Pods       |
|  matih-control-plane    4/8 cores   8/16 GiB   24/50     |
|  matih-data-plane       12/24 cores 32/64 GiB  48/100    |
|  matih-observability    2/4 cores   6/12 GiB   12/30     |
|  tenant-acme-corp       3/8 cores   6/16 GiB   14/50     |
|  tenant-globex          2/8 cores   4/16 GiB   14/50     |
+-----------------------------------------------------------+

Reliability Page

The ReliabilityPage displays Service Level Objectives (SLOs) and error budgets:

SLO Dashboard

SLOTargetCurrentBudget RemainingStatus
API availability99.9%99.95%72%On track
Query latency P99under 500ms380ms85%On track
Dashboard load timeunder 3s2.4s90%On track
AI response latencyunder 5s4.2s35%At risk
Data freshnessunder 15min8min65%On track

Error Budget

The error budget visualization shows how much unreliability is "allowed" before the SLO is breached:

API Availability (99.9% SLO, 30-day window)
  Total budget: 43.2 minutes of downtime
  Consumed: 12.1 minutes (28%)
  Remaining: 31.1 minutes (72%)
  Burn rate: 0.8x (sustainable)

  [=========|...................] 28% consumed
  Days remaining in window: 18

Cost Page

The CostPage provides cost analysis and optimization recommendations:

Cost Dashboard

WidgetContent
Monthly costCurrent month running total with daily breakdown
Cost by serviceStacked area chart showing cost per service
Cost by tenantTop tenants by resource consumption
Cost trends6-month trend with projection
Optimization recommendationsAI-generated cost reduction suggestions
Budget alertsTenants approaching budget limits

Ops Chat

The ChatPage provides an AI-assisted operations chat for troubleshooting:

CapabilityExample
Issue diagnosis"Why is the AI Service showing high latency?"
Metric queries"Show me the P99 latency for query engine over the last hour"
Runbook execution"Run the database connection pool health check"
Impact analysis"What would happen if we restart the Trino coordinator?"
Historical analysis"Were there any similar incidents in the last 30 days?"

Managed Operations Data Hook

// frontend/ops-workbench/src/hooks/use-mops-data.ts
export function useMopsData() {
  // Aggregated operations data from multiple sources
  // Combines metrics from Prometheus, logs from Loki,
  // alerts from Alertmanager, and incidents from the ops service
 
  return {
    health,           // Platform health summary
    alerts,           // Active alerts
    incidents,        // Open incidents
    deployments,      // Recent deployments
    metrics,          // Key platform metrics
    isLoading,
    error,
    refresh,
  };
}

Key Source Files

ComponentLocation
Application entryfrontend/ops-workbench/src/App.tsx
Dashboard pagefrontend/ops-workbench/src/pages/DashboardPage.tsx
Alerts pagefrontend/ops-workbench/src/pages/AlertsPage.tsx
Incidents pagefrontend/ops-workbench/src/pages/IncidentsPage.tsx
Deployments pagefrontend/ops-workbench/src/pages/DeploymentsPage.tsx
Infrastructure pagefrontend/ops-workbench/src/pages/InfrastructurePage.tsx
Reliability pagefrontend/ops-workbench/src/pages/ReliabilityPage.tsx
Cost pagefrontend/ops-workbench/src/pages/CostPage.tsx
Chat pagefrontend/ops-workbench/src/pages/ChatPage.tsx
Health componentsfrontend/ops-workbench/src/components/health/
Dashboard componentsfrontend/ops-workbench/src/components/dashboard/
Ops chatfrontend/ops-workbench/src/components/chat/
MOPS data hookfrontend/ops-workbench/src/hooks/use-mops-data.ts