Prometheus

Prometheus collects metrics from all MATIH services via ServiceMonitor CRDs, with separate instances for the control plane and data plane.

Deployment

Each plane has a dedicated Prometheus instance:

Instance	Namespace	Retention	Storage
Control Plane	matih-monitoring-control-plane	15 days	50Gi SSD
Data Plane	matih-monitoring-data-plane	15 days	100Gi SSD

ServiceMonitor Pattern

Every MATIH service deploys a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ai-service
  labels:
    app.kubernetes.io/name: ai-service
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: ai-service
  endpoints:
    - port: http
      path: /metrics          # Python services
      interval: 30s
      scrapeTimeout: 10s

For Java Spring Boot services, the metrics path is /actuator/prometheus.

Scrape Configuration

Service Type	Metrics Path	Port	Interval
Java Spring Boot	/actuator/prometheus	8080	30s
Python FastAPI	/metrics	8000	30s
Node.js	/metrics	3000	30s
Trino	/v1/status	8080	30s
Kafka (JMX)	/metrics	9404	30s

Key Metrics

Metric	Type	Purpose
http_requests_total	Counter	Request rate per endpoint
http_request_duration_seconds	Histogram	Latency distribution
ai_service_inference_requests_per_second	Gauge	AI inference throughput
ai_service_llm_token_usage_rate	Gauge	LLM token consumption
kafka_consumer_lag	Gauge	Kafka consumer lag
trino_query_duration_seconds	Histogram	Query execution time

Pod Annotations

Services expose metrics via pod annotations for legacy scraping:

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8000"
  prometheus.io/path: "/metrics"

Overview Grafana