MATIH Platform is in active MVP development. Documentation reflects current implementation status.
15. Workbench Architecture
Ops Workbench
Operations Dashboard

Operations Dashboard

The Operations Dashboard provides a real-time overview of platform health, service status, resource utilization, and active alerts. It is the primary landing page for operations teams and designed for wall-display monitoring with auto-refresh capabilities.


Dashboard Layout

The dashboard is organized into four quadrants:

QuadrantPositionContent
Service Health GridTop leftColor-coded service status matrix
Alert TimelineTop rightActive and recent alerts with severity
Resource UtilizationBottom leftCPU, memory, network gauges by namespace
Key MetricsBottom rightRequest rate, error rate, latency charts

Service Health Grid

A matrix display showing health status for all platform services:

Service GroupServicesHealth Check
Control PlaneIAM, Tenant, Config, API Gateway, Billing, Audit, NotificationHTTP /health
Data PlaneAI, ML, Query Engine, Catalog, Semantic Layer, PipelineHTTP /health
Data InfrastructurePostgreSQL, Redis, Kafka, Qdrant, Dgraph, Trino, ClickHouseTCP + protocol
ML InfrastructureRay, MLflow, Feast, TritonHTTP health
MonitoringPrometheus, Grafana, Loki, TempoHTTP health

Status Indicators

StatusColorCriteria
HealthyGreenAll pods running, health checks passing
WarningYellowPod restarts detected or partial health check failure
CriticalRedService unavailable or major health check failure
UnknownGrayUnable to determine status

Alert Summary

Active alerts displayed with severity and age:

interface Alert {
  id: string;
  severity: 'critical' | 'warning' | 'info';
  title: string;
  service: string;
  message: string;
  fired_at: string;
  acknowledged: boolean;
  assignee?: string;
}
SeverityDisplaySound
CriticalRed badge, pulsingOptional audible alert
WarningYellow badgeNone
InfoBlue badgeNone

Resource Utilization

Real-time gauge charts for cluster resources:

MetricDisplaySource
CPU utilizationGauge (0-100%)Prometheus node_cpu_seconds_total
Memory utilizationGauge (0-100%)Prometheus node_memory_MemAvailable_bytes
Disk usageGauge (0-100%)Prometheus node_filesystem_avail_bytes
Network I/OSparklinePrometheus node_network_receive_bytes_total
Pod countCounterKubernetes API

Key Metrics Charts

Time-series charts for platform-wide operational metrics:

ChartMetricTime Range
Request RateTotal HTTP requests per secondLast 1 hour
Error Rate5xx responses as percentageLast 1 hour
Latency (p95)95th percentile response timeLast 1 hour
Active SessionsWebSocket + chat sessionsLast 1 hour

Auto-Refresh

The dashboard auto-refreshes on configurable intervals:

ComponentRefresh IntervalConfigurable
Service health10 secondsYes
Alerts15 secondsYes
Resource gauges15 secondsYes
Metric charts30 secondsYes

Wall Display Mode

A full-screen mode optimized for wall-mounted displays:

FeatureDescription
Dark themeReduced eye strain for constant display
Large fontsReadable from distance
No navigationDashboard only, no sidebar
Auto-rotateCycle between dashboard views
Alert flashScreen flash on critical alerts