Prometheus
Prometheus collects metrics from all MATIH services via ServiceMonitor CRDs, with separate instances for the control plane and data plane.
Deployment
Each plane has a dedicated Prometheus instance:
| Instance | Namespace | Retention | Storage |
|---|---|---|---|
| Control Plane | matih-monitoring-control-plane | 15 days | 50Gi SSD |
| Data Plane | matih-monitoring-data-plane | 15 days | 100Gi SSD |
ServiceMonitor Pattern
Every MATIH service deploys a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ai-service
labels:
app.kubernetes.io/name: ai-service
spec:
selector:
matchLabels:
app.kubernetes.io/name: ai-service
endpoints:
- port: http
path: /metrics # Python services
interval: 30s
scrapeTimeout: 10sFor Java Spring Boot services, the metrics path is /actuator/prometheus.
Scrape Configuration
| Service Type | Metrics Path | Port | Interval |
|---|---|---|---|
| Java Spring Boot | /actuator/prometheus | 8080 | 30s |
| Python FastAPI | /metrics | 8000 | 30s |
| Node.js | /metrics | 3000 | 30s |
| Trino | /v1/status | 8080 | 30s |
| Kafka (JMX) | /metrics | 9404 | 30s |
Key Metrics
| Metric | Type | Purpose |
|---|---|---|
| http_requests_total | Counter | Request rate per endpoint |
| http_request_duration_seconds | Histogram | Latency distribution |
| ai_service_inference_requests_per_second | Gauge | AI inference throughput |
| ai_service_llm_token_usage_rate | Gauge | LLM token consumption |
| kafka_consumer_lag | Gauge | Kafka consumer lag |
| trino_query_duration_seconds | Histogram | Query execution time |
Pod Annotations
Services expose metrics via pod annotations for legacy scraping:
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"