Observability

The MATIH Platform provides full-stack observability across all 24 microservices, covering metrics, distributed tracing, log aggregation, and alerting. Observability is tenant-aware, meaning that every metric, trace, and log entry carries the tenant context, enabling per-tenant monitoring and debugging.

Observability Stack

The platform uses a combination of open-source tools for comprehensive observability:

Component	Technology	Purpose
Metrics collection	Prometheus	Time-series metrics from all services
Metrics visualization	Grafana	Dashboards, alerting, and exploration
Distributed tracing	Tempo	End-to-end request tracing
Log aggregation	Loki	Centralized log collection and search
Instrumentation	OpenTelemetry	Standardized telemetry collection
Full-text log search	Elasticsearch 8.11	Audit log search and analytics
Health monitoring	Spring Boot Actuator	Service-level health checks

Three Pillars

Metrics

Every service exposes Prometheus metrics through Micrometer integration. Key metric categories include:

Category	Example Metrics	Labels
Request latency	`http_server_requests_seconds`	`method`, `uri`, `status`, `tenant_id`
Query performance	`matih_query_duration_seconds`	`query_type`, `tenant_id`, `service`
AI token consumption	`matih_ai_tokens_consumed_total`	`model`, `tenant_id`, `agent`
Event streaming	`dataplane_events_published`	`service`, `event_type`
Connection pools	`hikaricp_connections_active`	`pool`, `service`
Cache operations	`cache_gets_total`, `cache_puts_total`	`cache_name`, `result`

Distributed Tracing

OpenTelemetry spans propagate through every service boundary, enabling end-to-end request tracing:

Trace: "What was revenue last quarter?"
  |
  +-- api-gateway (2ms)
  |     +-- jwt-validation (1ms)
  |
  +-- ai-service (1800ms)
  |     +-- router-agent (250ms)
  |     +-- sql-agent (800ms)
  |     +-- query-engine (400ms)
  |           +-- trino-execution (350ms)
  |     +-- analysis-agent (300ms)
  |
  +-- audit-event-publish (5ms)

Every span carries tenant context as an attribute:

Attribute	Value
`tenant.id`	`acme-corp`
`user.id`	`user-123`
`correlation.id`	`req-abc-456`

Log Aggregation

All services produce structured JSON logs enriched with tenant context:

{
  "timestamp": "2026-02-12T10:30:00Z",
  "level": "INFO",
  "service": "ai-service",
  "tenant_id": "acme-corp",
  "user_id": "user-123",
  "correlation_id": "req-abc-456",
  "trace_id": "abc123def456",
  "message": "Query generated successfully"
}

Alerting

Alerts are configured in Grafana based on Prometheus metrics. Standard alert rules include:

Alert	Condition	Severity
Service down	Health check failing for more than 60 seconds	Critical
High error rate	5xx rate exceeding 5% of requests	Critical
Slow queries	p95 query latency exceeding 30 seconds	Warning
Kafka consumer lag	Consumer lag exceeding 1000 messages	Warning
Event processing failures	Failed event rate exceeding 1%	Warning
High memory usage	Container memory exceeding 90% of limit	Warning
Certificate expiry	TLS certificate expiring within 14 days	Warning

Health Checks

Each service registers deep health checks via Spring Boot Actuator or FastAPI health endpoints. Health checks verify that the service can perform its core functions:

Check	What It Verifies
Database connectivity	Can execute a query against PostgreSQL
Redis connectivity	Can read and write to Redis
Kafka connectivity	Can reach Kafka brokers
Downstream services	Can reach declared service dependencies
Disk space	Sufficient disk space for operation

Health status is aggregated by the observability-api service and exposed to the platform admin UI.

Kubernetes Namespace Organization

Observability components are deployed in dedicated namespaces:

Namespace	Components
`matih-observability`	Prometheus, Grafana, Tempo, Loki
`matih-monitoring-control-plane`	Control Plane service monitors and alerts
`matih-monitoring-data-plane`	Data Plane service monitors and alerts

Performance Targets

Metric	Target
Text-to-SQL latency	Less than 3 seconds (p95)
Simple query execution	Less than 500ms (p95)
Complex analytics query	Less than 30 seconds (p95)
Dashboard load (cached)	Less than 150ms (p95)
LLM inference	Less than 100ms (p50)
Concurrent users per tenant	1000+

Multi-Tenancy -- Per-tenant resource isolation
Architecture: Service Topology -- Service dependency and failure analysis
Architecture: Event-Driven -- Event metrics and monitoring

Multi-Tenancy User Personas