Observability and Health
The Observability page provides unified access to logs, metrics, traces, and health checks across all platform services. It aggregates data from Prometheus, Grafana Loki, and Grafana Tempo through the Observability API, enabling operators to correlate signals across the three pillars of observability.
Observability Pillars
| Pillar | Backend | Query Language | UI Component |
|---|---|---|---|
| Metrics | Prometheus | PromQL | Time-series charts |
| Logs | Grafana Loki | LogQL | Log stream viewer |
| Traces | Grafana Tempo | TraceQL | Trace waterfall |
Metrics Explorer
The metrics explorer allows operators to build custom PromQL queries and visualize results:
| Feature | Description |
|---|---|
| Query builder | Visual PromQL builder with autocomplete |
| Raw query | Direct PromQL input for advanced users |
| Chart types | Line, area, bar, heatmap, gauge |
| Time range | Adjustable time window with zoom |
| Alerts | Link metrics to alert rules |
Common Queries
| Query | Purpose |
|---|---|
rate(http_requests_total[5m]) | Request rate per second |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) | p95 latency |
sum(rate(http_requests_total{status=~"5.."}[5m])) | Error rate |
ai_service_llm_tokens_total | LLM token consumption |
ai_service_active_sessions | Active chat sessions |
Log Viewer
The log viewer streams and searches logs from Grafana Loki:
| Feature | Description |
|---|---|
| Service filter | Filter logs by service name |
| Severity filter | Filter by log level (debug, info, warn, error) |
| Full-text search | Search across log content |
| Time correlation | Sync log time range with metrics and traces |
| Context expansion | Show surrounding log lines |
| Live tail | Real-time log streaming |
Log Query Examples
Filter AI Service errors:
{service="ai-service"} |= "error"Filter data plane errors with JSON parsing:
{namespace="matih-data-plane"} | json | level="error"Filter query engine timeouts:
{service="query-engine"} |= "timeout"Trace Explorer
The trace explorer displays distributed traces across services:
| Feature | Description |
|---|---|
| Trace search | Find traces by service, duration, status |
| Waterfall view | Visual span timeline across services |
| Span details | Attributes, events, and status per span |
| Service map | Dependency graph derived from traces |
| Correlation | Link to related logs and metrics |
Trace Search
| Filter | Options |
|---|---|
| Service | Any platform service |
| Duration | Minimum/maximum trace duration |
| Status | OK, Error |
| Time range | Configurable window |
| Tags | Custom span attributes |
Health Checks
Aggregated health check results for all services:
| Health Check | Type | Frequency |
|---|---|---|
| HTTP health | GET /health | Every 10 seconds |
| HTTP readiness | GET /health/ready | Every 15 seconds |
| TCP check | Port connectivity | Every 10 seconds |
| Database check | Query execution | Every 30 seconds |
Correlation
The key feature of the observability page is signal correlation:
- Click a spike in a metrics chart
- Automatically filter logs to that time window
- Find related traces with elevated latency
- Drill down to specific span errors
Configuration
| Setting | Default | Description |
|---|---|---|
| Default time range | Last 1 hour | Initial time window |
| Refresh interval | 15 seconds | Auto-refresh period |
| Max log lines | 1000 | Maximum logs per query |
| Max traces | 100 | Maximum traces per search |
Frontend Telemetry
The ops-workbench is wrapped with ObservabilityProvider from @matih/shared, which automatically:
- Initializes Sentry error tracking (when
VITE_SENTRY_DSNis configured) - Reports Core Web Vitals (CLS, LCP, INP, FID, FCP, TTFB) to the analytics backend
- Tracks user sessions with visibility API integration
- Catches uncaught errors and unhandled promise rejections globally
This provides full client-side observability for the operator experience, complementing the server-side metrics visible in the observability dashboards.