Observability
The MATIH Platform provides full-stack observability across all 24 microservices, covering metrics, distributed tracing, log aggregation, and alerting. Observability is tenant-aware, meaning that every metric, trace, and log entry carries the tenant context, enabling per-tenant monitoring and debugging.
Observability Stack
The platform uses a combination of open-source tools for comprehensive observability:
| Component | Technology | Purpose |
|---|---|---|
| Metrics collection | Prometheus | Time-series metrics from all services |
| Metrics visualization | Grafana | Dashboards, alerting, and exploration |
| Distributed tracing | Tempo | End-to-end request tracing |
| Log aggregation | Loki | Centralized log collection and search |
| Instrumentation | OpenTelemetry | Standardized telemetry collection |
| Full-text log search | Elasticsearch 8.11 | Audit log search and analytics |
| Health monitoring | Spring Boot Actuator | Service-level health checks |
Three Pillars
Metrics
Every service exposes Prometheus metrics through Micrometer integration. Key metric categories include:
| Category | Example Metrics | Labels |
|---|---|---|
| Request latency | http_server_requests_seconds | method, uri, status, tenant_id |
| Query performance | matih_query_duration_seconds | query_type, tenant_id, service |
| AI token consumption | matih_ai_tokens_consumed_total | model, tenant_id, agent |
| Event streaming | dataplane_events_published | service, event_type |
| Connection pools | hikaricp_connections_active | pool, service |
| Cache operations | cache_gets_total, cache_puts_total | cache_name, result |
Distributed Tracing
OpenTelemetry spans propagate through every service boundary, enabling end-to-end request tracing:
Trace: "What was revenue last quarter?"
|
+-- api-gateway (2ms)
| +-- jwt-validation (1ms)
|
+-- ai-service (1800ms)
| +-- router-agent (250ms)
| +-- sql-agent (800ms)
| +-- query-engine (400ms)
| +-- trino-execution (350ms)
| +-- analysis-agent (300ms)
|
+-- audit-event-publish (5ms)Every span carries tenant context as an attribute:
| Attribute | Value |
|---|---|
tenant.id | acme-corp |
user.id | user-123 |
correlation.id | req-abc-456 |
Log Aggregation
All services produce structured JSON logs enriched with tenant context:
{
"timestamp": "2026-02-12T10:30:00Z",
"level": "INFO",
"service": "ai-service",
"tenant_id": "acme-corp",
"user_id": "user-123",
"correlation_id": "req-abc-456",
"trace_id": "abc123def456",
"message": "Query generated successfully"
}Alerting
Alerts are configured in Grafana based on Prometheus metrics. Standard alert rules include:
| Alert | Condition | Severity |
|---|---|---|
| Service down | Health check failing for more than 60 seconds | Critical |
| High error rate | 5xx rate exceeding 5% of requests | Critical |
| Slow queries | p95 query latency exceeding 30 seconds | Warning |
| Kafka consumer lag | Consumer lag exceeding 1000 messages | Warning |
| Event processing failures | Failed event rate exceeding 1% | Warning |
| High memory usage | Container memory exceeding 90% of limit | Warning |
| Certificate expiry | TLS certificate expiring within 14 days | Warning |
Health Checks
Each service registers deep health checks via Spring Boot Actuator or FastAPI health endpoints. Health checks verify that the service can perform its core functions:
| Check | What It Verifies |
|---|---|
| Database connectivity | Can execute a query against PostgreSQL |
| Redis connectivity | Can read and write to Redis |
| Kafka connectivity | Can reach Kafka brokers |
| Downstream services | Can reach declared service dependencies |
| Disk space | Sufficient disk space for operation |
Health status is aggregated by the observability-api service and exposed to the platform admin UI.
Kubernetes Namespace Organization
Observability components are deployed in dedicated namespaces:
| Namespace | Components |
|---|---|
matih-observability | Prometheus, Grafana, Tempo, Loki |
matih-monitoring-control-plane | Control Plane service monitors and alerts |
matih-monitoring-data-plane | Data Plane service monitors and alerts |
Performance Targets
| Metric | Target |
|---|---|
| Text-to-SQL latency | Less than 3 seconds (p95) |
| Simple query execution | Less than 500ms (p95) |
| Complex analytics query | Less than 30 seconds (p95) |
| Dashboard load (cached) | Less than 150ms (p95) |
| LLM inference | Less than 100ms (p50) |
| Concurrent users per tenant | 1000+ |
Related Pages
- Multi-Tenancy -- Per-tenant resource isolation
- Architecture: Service Topology -- Service dependency and failure analysis
- Architecture: Event-Driven -- Event metrics and monitoring