Observability and Health

The Observability page provides unified access to logs, metrics, traces, and health checks across all platform services. It aggregates data from Prometheus, Grafana Loki, and Grafana Tempo through the Observability API, enabling operators to correlate signals across the three pillars of observability.

Observability Pillars

Pillar	Backend	Query Language	UI Component
Metrics	Prometheus	PromQL	Time-series charts
Logs	Grafana Loki	LogQL	Log stream viewer
Traces	Grafana Tempo	TraceQL	Trace waterfall

Metrics Explorer

The metrics explorer allows operators to build custom PromQL queries and visualize results:

Feature	Description
Query builder	Visual PromQL builder with autocomplete
Raw query	Direct PromQL input for advanced users
Chart types	Line, area, bar, heatmap, gauge
Time range	Adjustable time window with zoom
Alerts	Link metrics to alert rules

Common Queries

Query	Purpose
`rate(http_requests_total[5m])`	Request rate per second
`histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))`	p95 latency
`sum(rate(http_requests_total{status=~"5.."}[5m]))`	Error rate
`ai_service_llm_tokens_total`	LLM token consumption
`ai_service_active_sessions`	Active chat sessions

Log Viewer

The log viewer streams and searches logs from Grafana Loki:

Feature	Description
Service filter	Filter logs by service name
Severity filter	Filter by log level (debug, info, warn, error)
Full-text search	Search across log content
Time correlation	Sync log time range with metrics and traces
Context expansion	Show surrounding log lines
Live tail	Real-time log streaming

Log Query Examples

Filter AI Service errors:

{service="ai-service"} |= "error"

Filter data plane errors with JSON parsing:

{namespace="matih-data-plane"} | json | level="error"

Filter query engine timeouts:

{service="query-engine"} |= "timeout"

Trace Explorer

The trace explorer displays distributed traces across services:

Feature	Description
Trace search	Find traces by service, duration, status
Waterfall view	Visual span timeline across services
Span details	Attributes, events, and status per span
Service map	Dependency graph derived from traces
Correlation	Link to related logs and metrics

Trace Search

Filter	Options
Service	Any platform service
Duration	Minimum/maximum trace duration
Status	OK, Error
Time range	Configurable window
Tags	Custom span attributes

Health Checks

Aggregated health check results for all services:

Health Check	Type	Frequency
HTTP health	`GET /health`	Every 10 seconds
HTTP readiness	`GET /health/ready`	Every 15 seconds
TCP check	Port connectivity	Every 10 seconds
Database check	Query execution	Every 30 seconds

Correlation

The key feature of the observability page is signal correlation:

Click a spike in a metrics chart
Automatically filter logs to that time window
Find related traces with elevated latency
Drill down to specific span errors

Configuration

Setting	Default	Description
Default time range	Last 1 hour	Initial time window
Refresh interval	15 seconds	Auto-refresh period
Max log lines	1000	Maximum logs per query
Max traces	100	Maximum traces per search

Frontend Telemetry

The ops-workbench is wrapped with ObservabilityProvider from @matih/shared, which automatically:

Initializes Sentry error tracking (when VITE_SENTRY_DSN is configured)
Reports Core Web Vitals (CLS, LCP, INP, FID, FCP, TTFB) to the analytics backend
Tracks user sessions with visibility API integration
Catches uncaught errors and unhandled promise rejections globally

This provides full client-side observability for the operator experience, complementing the server-side metrics visible in the observability dashboards.

Operations Dashboard Incident Management