MATIH Platform is in active MVP development. Documentation reflects current implementation status.
1. Introduction
Observability

Observability

The MATIH Platform provides full-stack observability across all 24 microservices, covering metrics, distributed tracing, log aggregation, and alerting. Observability is tenant-aware, meaning that every metric, trace, and log entry carries the tenant context, enabling per-tenant monitoring and debugging.


Observability Stack

The platform uses a combination of open-source tools for comprehensive observability:

ComponentTechnologyPurpose
Metrics collectionPrometheusTime-series metrics from all services
Metrics visualizationGrafanaDashboards, alerting, and exploration
Distributed tracingTempoEnd-to-end request tracing
Log aggregationLokiCentralized log collection and search
InstrumentationOpenTelemetryStandardized telemetry collection
Full-text log searchElasticsearch 8.11Audit log search and analytics
Health monitoringSpring Boot ActuatorService-level health checks

Three Pillars

Metrics

Every service exposes Prometheus metrics through Micrometer integration. Key metric categories include:

CategoryExample MetricsLabels
Request latencyhttp_server_requests_secondsmethod, uri, status, tenant_id
Query performancematih_query_duration_secondsquery_type, tenant_id, service
AI token consumptionmatih_ai_tokens_consumed_totalmodel, tenant_id, agent
Event streamingdataplane_events_publishedservice, event_type
Connection poolshikaricp_connections_activepool, service
Cache operationscache_gets_total, cache_puts_totalcache_name, result

Distributed Tracing

OpenTelemetry spans propagate through every service boundary, enabling end-to-end request tracing:

Trace: "What was revenue last quarter?"
  |
  +-- api-gateway (2ms)
  |     +-- jwt-validation (1ms)
  |
  +-- ai-service (1800ms)
  |     +-- router-agent (250ms)
  |     +-- sql-agent (800ms)
  |     +-- query-engine (400ms)
  |           +-- trino-execution (350ms)
  |     +-- analysis-agent (300ms)
  |
  +-- audit-event-publish (5ms)

Every span carries tenant context as an attribute:

AttributeValue
tenant.idacme-corp
user.iduser-123
correlation.idreq-abc-456

Log Aggregation

All services produce structured JSON logs enriched with tenant context:

{
  "timestamp": "2026-02-12T10:30:00Z",
  "level": "INFO",
  "service": "ai-service",
  "tenant_id": "acme-corp",
  "user_id": "user-123",
  "correlation_id": "req-abc-456",
  "trace_id": "abc123def456",
  "message": "Query generated successfully"
}

Alerting

Alerts are configured in Grafana based on Prometheus metrics. Standard alert rules include:

AlertConditionSeverity
Service downHealth check failing for more than 60 secondsCritical
High error rate5xx rate exceeding 5% of requestsCritical
Slow queriesp95 query latency exceeding 30 secondsWarning
Kafka consumer lagConsumer lag exceeding 1000 messagesWarning
Event processing failuresFailed event rate exceeding 1%Warning
High memory usageContainer memory exceeding 90% of limitWarning
Certificate expiryTLS certificate expiring within 14 daysWarning

Health Checks

Each service registers deep health checks via Spring Boot Actuator or FastAPI health endpoints. Health checks verify that the service can perform its core functions:

CheckWhat It Verifies
Database connectivityCan execute a query against PostgreSQL
Redis connectivityCan read and write to Redis
Kafka connectivityCan reach Kafka brokers
Downstream servicesCan reach declared service dependencies
Disk spaceSufficient disk space for operation

Health status is aggregated by the observability-api service and exposed to the platform admin UI.


Kubernetes Namespace Organization

Observability components are deployed in dedicated namespaces:

NamespaceComponents
matih-observabilityPrometheus, Grafana, Tempo, Loki
matih-monitoring-control-planeControl Plane service monitors and alerts
matih-monitoring-data-planeData Plane service monitors and alerts

Performance Targets

MetricTarget
Text-to-SQL latencyLess than 3 seconds (p95)
Simple query executionLess than 500ms (p95)
Complex analytics queryLess than 30 seconds (p95)
Dashboard load (cached)Less than 150ms (p95)
LLM inferenceLess than 100ms (p50)
Concurrent users per tenant1000+

Related Pages