Loki
Loki provides log aggregation for all MATIH services, collecting structured JSON logs via Fluent-bit/Promtail and making them queryable via LogQL.
Architecture
+------------------+ +------------------+ +------------------+
| Service Pods | | Fluent-bit | | Loki |
| (JSON logs) |---->| (DaemonSet) |---->| (Log storage) |
+------------------+ +------------------+ +------------------+
|
v
+------------------+
| Grafana |
| (LogQL queries) |
+------------------+Log Format
All MATIH services emit structured JSON logs:
{
"timestamp": "2026-02-12T10:30:00Z",
"level": "INFO",
"service": "ai-service",
"tenant_id": "tenant-acme",
"trace_id": "abc123def456",
"message": "Query completed successfully",
"duration_ms": 234,
"user_id": "user-001"
}LogQL Examples
# All errors from AI service in the last hour
{namespace="matih-data-plane", app="ai-service"} |= "ERROR"
# JSON parsing with field extraction
{namespace="matih-data-plane"} | json | level="ERROR" | duration_ms > 5000
# Count errors per service
sum by (app) (count_over_time(
{namespace="matih-data-plane"} |= "ERROR" [5m]
))Retention
| Tier | Retention | Storage |
|---|---|---|
| Hot | 7 days | SSD |
| Warm | 30 days | HDD |
| Cold | 90 days | S3/MinIO |