Prometheus Setup
Prometheus is the metrics backbone of MATIH's observability stack. It scrapes metrics from all services via ServiceMonitor CRDs, stores time-series data, evaluates alerting rules, and serves as the primary data source for Grafana dashboards.
Installation
Prometheus is deployed as part of the kube-prometheus-stack Helm chart, which includes Prometheus Operator, Grafana, Alertmanager, and default recording rules.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace matih-monitoring \
--create-namespace \
-f infrastructure/monitoring/prometheus/values.yamlService Discovery
Prometheus discovers scrape targets automatically through ServiceMonitor CRDs:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ai-service
namespace: matih-monitoring
labels:
release: monitoring
spec:
selector:
matchLabels:
app: ai-service
namespaceSelector:
matchNames:
- matih-data-plane
endpoints:
- port: http
path: /metrics
interval: 15sScrape Configuration
| Parameter | Value | Description |
|---|---|---|
| Scrape interval | 15s | Default scrape interval |
| Evaluation interval | 15s | Rule evaluation interval |
| Retention | 15d | Metrics retention period |
| Storage | 50Gi PVC | Persistent storage for TSDB |
Service Metrics Endpoints
| Service | Port | Path | Technology |
|---|---|---|---|
| AI Service | 8000 | /metrics | prometheus_client (Python) |
| Data Quality Service | 8001 | /metrics | prometheus_client (Python) |
| ML Service | 8002 | /metrics | prometheus_client (Python) |
| Pipeline Service | 8003 | /metrics | prometheus_client (Python) |
| Ops Agent Service | 8004 | /metrics | prometheus_client (Python) |
| Ontology Service | 8005 | /metrics | prometheus_client (Python) |
| dbt Server | 8006 | /metrics | prometheus-fastapi-instrumentator (Python) |
| Auth Proxy | 8007 | /metrics | prometheus_client (Python) |
| Query Engine | 8080 | /actuator/prometheus | Micrometer (Spring Boot) |
| IAM Service | 8081 | /actuator/prometheus | Micrometer (Spring Boot) |
| Tenant Service | 8082 | /actuator/prometheus | Micrometer (Spring Boot) |
| API Gateway | 8080 | /actuator/prometheus | Micrometer (Spring Boot) |
| Config Service | 8888 | /actuator/prometheus | Micrometer (Spring Boot) |
| Notification Service | 8085 | /actuator/prometheus | Micrometer (Spring Boot) |
| Audit Service | 8086 | /actuator/prometheus | Micrometer (Spring Boot) |
| Billing Service | 8087 | /actuator/prometheus | Micrometer (Spring Boot) |
| Observability API | 8088 | /actuator/prometheus | Micrometer (Spring Boot) |
| Infrastructure Service | 8089 | /actuator/prometheus | Micrometer (Spring Boot) |
| Platform Registry | 8084 | /actuator/prometheus | Micrometer (Spring Boot) |
| Catalog Service | 8090 | /actuator/prometheus | Micrometer (Spring Boot) |
| BI Service | 8091 | /actuator/prometheus | Micrometer (Spring Boot) |
| Governance Service | 8092 | /actuator/prometheus | Micrometer (Spring Boot) |
| Semantic Layer | 8093 | /actuator/prometheus | Micrometer (Spring Boot) |
| Data Plane Agent | 8094 | /actuator/prometheus | Micrometer (Spring Boot) |
Key Metric Families
HTTP and LLM Metrics
| Metric | Type | Description |
|---|---|---|
matih_http_requests_total | Counter | Total HTTP requests by status class |
matih_http_request_duration_seconds | Histogram | Request latency distribution |
matih_llm_requests_total | Counter | LLM API calls |
matih_llm_tokens_total | Counter | LLM token consumption |
matih_active_sessions | Gauge | Currently active user sessions |
Provisioning Pipeline Metrics
| Metric | Type | Description |
|---|---|---|
matih_provisioning_started_total | Counter | Tenant provisioning started |
matih_provisioning_completed_total | Counter | Tenant provisioning completed |
matih_provisioning_failed_total | Counter | Tenant provisioning failed |
matih_provisioning_step_duration_seconds | Histogram | Per-step provisioning duration |
matih_provisioning_k8s_api | Timer | Per-method K8s API call duration (createNamespace, createResourceQuota, etc.) |
matih_provisioning_helm_deploy | Timer | Per-service Helm deployment duration |
matih_provisioning_data_infra | Timer | Data infrastructure operation duration (provision_database, setup_cache, etc.) |
Domain Metrics
| Metric | Type | Description |
|---|---|---|
matih_gateway_requests_total | Counter | Gateway request counts by service, tenant, and status |
matih_gateway_auth_failures_total | Counter | Authentication failure counts by reason |
matih_gateway_route_latency_seconds | Timer | Route-level latency with percentile histogram |
matih_billing_operations_total | Counter | Billing operation counts by operation and tier |
matih_billing_metering_records_total | Counter | Metering record ingestion count |
matih_billing_invoice_generation_seconds | Timer | Invoice generation duration |
matih_notification_sent_total | Counter | Notification delivery counts by channel and status |
matih_notification_delivery_failures_total | Counter | Delivery failure counts by channel and error type |
matih_notification_latency_seconds | Timer | Notification delivery latency |
matih_iam_auth_attempts_total | Counter | Authentication attempt counts by outcome |
matih_iam_token_generations_total | Counter | Token generation counts |
matih_iam_role_operations_total | Counter | Role management operation counts |
dbt Server Metrics
| Metric | Type | Description |
|---|---|---|
dbt_run_duration_seconds | Histogram | dbt command execution duration |
dbt_run_total | Counter | Total dbt command executions |
dbt_model_compilation_errors_total | Counter | dbt model compilation errors |
dbt_active_runs | Gauge | Currently running dbt commands |
dbt_test_results_total | Counter | dbt test results (pass/fail/warn/error) |
Accessing Prometheus
# Port forward for local access
kubectl port-forward svc/monitoring-prometheus 9090:9090 -n matih-monitoringThen access the Prometheus UI at http://localhost:9090.
Federation
For multi-cluster deployments, Prometheus federation is used to aggregate metrics from regional clusters into a central Prometheus instance.