Ops Workbench Overview
The Ops Workbench is a Next.js/React application designed for platform operators and SRE teams to monitor, troubleshoot, and manage the MATIH platform. It provides an operations dashboard, observability tools, incident management, and a conversational chat interface for AI-assisted operations.
Application Structure
The Ops Workbench is located at frontend/ops-workbench/ and uses Next.js for server-side rendering:
| Directory | Purpose |
|---|---|
src/pages/ | Page components for each operational area |
src/components/ | Reusable operations UI components |
src/hooks/ | Custom hooks for data fetching and state |
src/services/ | API client services |
src/stores/ | Zustand state stores |
src/types/ | TypeScript type definitions |
src/utils/ | Utility functions |
Pages
| Page | Component | Route | Description |
|---|---|---|---|
| Dashboard | DashboardPage | /ops/dashboard | Operations overview with key metrics |
| Observability | ObservabilityPage | /ops/observability | Health monitoring, logs, traces |
| Incidents | IncidentsPage | /ops/incidents | Incident tracking and management |
| Chat | ChatPage | /ops/chat | AI-assisted operations chat |
| Alerts | AlertsPage | /ops/alerts | Alert management and configuration |
| Deployments | DeploymentsPage | /ops/deployments | Deployment history and rollbacks |
| Infrastructure | InfrastructurePage | /ops/infrastructure | Cluster resource monitoring |
| Reliability | ReliabilityPage | /ops/reliability | SLO tracking and error budgets |
| Cost | CostPage | /ops/cost | Infrastructure cost analysis |
Technology Stack
| Technology | Version | Purpose |
|---|---|---|
| Next.js | 14.x | React framework with SSR |
| React | 18.x | UI framework |
| TypeScript | 5.x | Type-safe development |
| Tailwind CSS | 3.x | Utility-first styling |
| Zustand | 4.x | State management |
| TanStack Query | 5.x | Server state management |
| Recharts | 2.x | Chart visualizations |
Data Sources
The Ops Workbench aggregates data from multiple observability backends:
| Source | Data | Protocol |
|---|---|---|
| Prometheus | Metrics | PromQL via Observability API |
| Grafana Loki | Logs | LogQL via Observability API |
| Grafana Tempo | Traces | TraceQL via Observability API |
| Kubernetes API | Pod/node status | REST via Infrastructure Service |
| Ops Agent Service | AI operations | REST + WebSocket |
Development
cd frontend/ops-workbench
npm install
npm run dev # Starts on development portDetailed Sections
| Section | Content |
|---|---|
| Operations Dashboard | Key metrics, service health, alerts summary |
| Observability and Health | Logs, metrics, traces, health checks |
| Incident Management | Incident lifecycle, postmortems, runbooks |
| Chat Interface | AI-assisted operations conversation |