Operations Dashboard

The Operations Dashboard provides a real-time overview of platform health, service status, resource utilization, and active alerts. It is the primary landing page for operations teams and designed for wall-display monitoring with auto-refresh capabilities.

Dashboard Layout

The dashboard is organized into four quadrants:

Quadrant	Position	Content
Service Health Grid	Top left	Color-coded service status matrix
Alert Timeline	Top right	Active and recent alerts with severity
Resource Utilization	Bottom left	CPU, memory, network gauges by namespace
Key Metrics	Bottom right	Request rate, error rate, latency charts

Service Health Grid

A matrix display showing health status for all platform services:

Service Group	Services	Health Check
Control Plane	IAM, Tenant, Config, API Gateway, Billing, Audit, Notification	HTTP `/health`
Data Plane	AI, ML, Query Engine, Catalog, Semantic Layer, Pipeline	HTTP `/health`
Data Infrastructure	PostgreSQL, Redis, Kafka, Qdrant, Dgraph, Trino, ClickHouse	TCP + protocol
ML Infrastructure	Ray, MLflow, Feast, Triton	HTTP health
Monitoring	Prometheus, Grafana, Loki, Tempo	HTTP health

Status Indicators

Status	Color	Criteria
Healthy	Green	All pods running, health checks passing
Warning	Yellow	Pod restarts detected or partial health check failure
Critical	Red	Service unavailable or major health check failure
Unknown	Gray	Unable to determine status

Alert Summary

Active alerts displayed with severity and age:

interface Alert {
  id: string;
  severity: 'critical' | 'warning' | 'info';
  title: string;
  service: string;
  message: string;
  fired_at: string;
  acknowledged: boolean;
  assignee?: string;
}

Severity	Display	Sound
Critical	Red badge, pulsing	Optional audible alert
Warning	Yellow badge	None
Info	Blue badge	None

Resource Utilization

Real-time gauge charts for cluster resources:

Metric	Display	Source
CPU utilization	Gauge (0-100%)	Prometheus `node_cpu_seconds_total`
Memory utilization	Gauge (0-100%)	Prometheus `node_memory_MemAvailable_bytes`
Disk usage	Gauge (0-100%)	Prometheus `node_filesystem_avail_bytes`
Network I/O	Sparkline	Prometheus `node_network_receive_bytes_total`
Pod count	Counter	Kubernetes API

Key Metrics Charts

Time-series charts for platform-wide operational metrics:

Chart	Metric	Time Range
Request Rate	Total HTTP requests per second	Last 1 hour
Error Rate	5xx responses as percentage	Last 1 hour
Latency (p95)	95th percentile response time	Last 1 hour
Active Sessions	WebSocket + chat sessions	Last 1 hour

Auto-Refresh

The dashboard auto-refreshes on configurable intervals:

Component	Refresh Interval	Configurable
Service health	10 seconds	Yes
Alerts	15 seconds	Yes
Resource gauges	15 seconds	Yes
Metric charts	30 seconds	Yes

Wall Display Mode

A full-screen mode optimized for wall-mounted displays:

Feature	Description
Dark theme	Reduced eye strain for constant display
Large fonts	Readable from distance
No navigation	Dashboard only, no sidebar
Auto-rotate	Cycle between dashboard views
Alert flash	Screen flash on critical alerts

Overview Observability & Health