MATIH Platform is in active MVP development. Documentation reflects current implementation status.
15. Workbench Architecture
Ops Workbench
Observability & Health

Observability and Health

The Observability page provides unified access to logs, metrics, traces, and health checks across all platform services. It aggregates data from Prometheus, Grafana Loki, and Grafana Tempo through the Observability API, enabling operators to correlate signals across the three pillars of observability.


Observability Pillars

PillarBackendQuery LanguageUI Component
MetricsPrometheusPromQLTime-series charts
LogsGrafana LokiLogQLLog stream viewer
TracesGrafana TempoTraceQLTrace waterfall

Metrics Explorer

The metrics explorer allows operators to build custom PromQL queries and visualize results:

FeatureDescription
Query builderVisual PromQL builder with autocomplete
Raw queryDirect PromQL input for advanced users
Chart typesLine, area, bar, heatmap, gauge
Time rangeAdjustable time window with zoom
AlertsLink metrics to alert rules

Common Queries

QueryPurpose
rate(http_requests_total[5m])Request rate per second
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))p95 latency
sum(rate(http_requests_total{status=~"5.."}[5m]))Error rate
ai_service_llm_tokens_totalLLM token consumption
ai_service_active_sessionsActive chat sessions

Log Viewer

The log viewer streams and searches logs from Grafana Loki:

FeatureDescription
Service filterFilter logs by service name
Severity filterFilter by log level (debug, info, warn, error)
Full-text searchSearch across log content
Time correlationSync log time range with metrics and traces
Context expansionShow surrounding log lines
Live tailReal-time log streaming

Log Query Examples

Filter AI Service errors:

{service="ai-service"} |= "error"

Filter data plane errors with JSON parsing:

{namespace="matih-data-plane"} | json | level="error"

Filter query engine timeouts:

{service="query-engine"} |= "timeout"

Trace Explorer

The trace explorer displays distributed traces across services:

FeatureDescription
Trace searchFind traces by service, duration, status
Waterfall viewVisual span timeline across services
Span detailsAttributes, events, and status per span
Service mapDependency graph derived from traces
CorrelationLink to related logs and metrics

Trace Search

FilterOptions
ServiceAny platform service
DurationMinimum/maximum trace duration
StatusOK, Error
Time rangeConfigurable window
TagsCustom span attributes

Health Checks

Aggregated health check results for all services:

Health CheckTypeFrequency
HTTP healthGET /healthEvery 10 seconds
HTTP readinessGET /health/readyEvery 15 seconds
TCP checkPort connectivityEvery 10 seconds
Database checkQuery executionEvery 30 seconds

Correlation

The key feature of the observability page is signal correlation:

  1. Click a spike in a metrics chart
  2. Automatically filter logs to that time window
  3. Find related traces with elevated latency
  4. Drill down to specific span errors

Configuration

SettingDefaultDescription
Default time rangeLast 1 hourInitial time window
Refresh interval15 secondsAuto-refresh period
Max log lines1000Maximum logs per query
Max traces100Maximum traces per search

Frontend Telemetry

The ops-workbench is wrapped with ObservabilityProvider from @matih/shared, which automatically:

  • Initializes Sentry error tracking (when VITE_SENTRY_DSN is configured)
  • Reports Core Web Vitals (CLS, LCP, INP, FID, FCP, TTFB) to the analytics backend
  • Tracks user sessions with visibility API integration
  • Catches uncaught errors and unhandled promise rejections globally

This provides full client-side observability for the operator experience, complementing the server-side metrics visible in the observability dashboards.