Operations Dashboard
The Operations Dashboard provides a real-time overview of platform health, service status, resource utilization, and active alerts. It is the primary landing page for operations teams and designed for wall-display monitoring with auto-refresh capabilities.
Dashboard Layout
The dashboard is organized into four quadrants:
| Quadrant | Position | Content |
|---|---|---|
| Service Health Grid | Top left | Color-coded service status matrix |
| Alert Timeline | Top right | Active and recent alerts with severity |
| Resource Utilization | Bottom left | CPU, memory, network gauges by namespace |
| Key Metrics | Bottom right | Request rate, error rate, latency charts |
Service Health Grid
A matrix display showing health status for all platform services:
| Service Group | Services | Health Check |
|---|---|---|
| Control Plane | IAM, Tenant, Config, API Gateway, Billing, Audit, Notification | HTTP /health |
| Data Plane | AI, ML, Query Engine, Catalog, Semantic Layer, Pipeline | HTTP /health |
| Data Infrastructure | PostgreSQL, Redis, Kafka, Qdrant, Dgraph, Trino, ClickHouse | TCP + protocol |
| ML Infrastructure | Ray, MLflow, Feast, Triton | HTTP health |
| Monitoring | Prometheus, Grafana, Loki, Tempo | HTTP health |
Status Indicators
| Status | Color | Criteria |
|---|---|---|
| Healthy | Green | All pods running, health checks passing |
| Warning | Yellow | Pod restarts detected or partial health check failure |
| Critical | Red | Service unavailable or major health check failure |
| Unknown | Gray | Unable to determine status |
Alert Summary
Active alerts displayed with severity and age:
interface Alert {
id: string;
severity: 'critical' | 'warning' | 'info';
title: string;
service: string;
message: string;
fired_at: string;
acknowledged: boolean;
assignee?: string;
}| Severity | Display | Sound |
|---|---|---|
| Critical | Red badge, pulsing | Optional audible alert |
| Warning | Yellow badge | None |
| Info | Blue badge | None |
Resource Utilization
Real-time gauge charts for cluster resources:
| Metric | Display | Source |
|---|---|---|
| CPU utilization | Gauge (0-100%) | Prometheus node_cpu_seconds_total |
| Memory utilization | Gauge (0-100%) | Prometheus node_memory_MemAvailable_bytes |
| Disk usage | Gauge (0-100%) | Prometheus node_filesystem_avail_bytes |
| Network I/O | Sparkline | Prometheus node_network_receive_bytes_total |
| Pod count | Counter | Kubernetes API |
Key Metrics Charts
Time-series charts for platform-wide operational metrics:
| Chart | Metric | Time Range |
|---|---|---|
| Request Rate | Total HTTP requests per second | Last 1 hour |
| Error Rate | 5xx responses as percentage | Last 1 hour |
| Latency (p95) | 95th percentile response time | Last 1 hour |
| Active Sessions | WebSocket + chat sessions | Last 1 hour |
Auto-Refresh
The dashboard auto-refreshes on configurable intervals:
| Component | Refresh Interval | Configurable |
|---|---|---|
| Service health | 10 seconds | Yes |
| Alerts | 15 seconds | Yes |
| Resource gauges | 15 seconds | Yes |
| Metric charts | 30 seconds | Yes |
Wall Display Mode
A full-screen mode optimized for wall-mounted displays:
| Feature | Description |
|---|---|
| Dark theme | Reduced eye strain for constant display |
| Large fonts | Readable from distance |
| No navigation | Dashboard only, no sidebar |
| Auto-rotate | Cycle between dashboard views |
| Alert flash | Screen flash on critical alerts |