MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Prometheus Setup

Prometheus Setup

Prometheus is the metrics backbone of MATIH's observability stack. It scrapes metrics from all services via ServiceMonitor CRDs, stores time-series data, evaluates alerting rules, and serves as the primary data source for Grafana dashboards.


Installation

Prometheus is deployed as part of the kube-prometheus-stack Helm chart, which includes Prometheus Operator, Grafana, Alertmanager, and default recording rules.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace matih-monitoring \
  --create-namespace \
  -f infrastructure/monitoring/prometheus/values.yaml

Service Discovery

Prometheus discovers scrape targets automatically through ServiceMonitor CRDs:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ai-service
  namespace: matih-monitoring
  labels:
    release: monitoring
spec:
  selector:
    matchLabels:
      app: ai-service
  namespaceSelector:
    matchNames:
      - matih-data-plane
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

Scrape Configuration

ParameterValueDescription
Scrape interval15sDefault scrape interval
Evaluation interval15sRule evaluation interval
Retention15dMetrics retention period
Storage50Gi PVCPersistent storage for TSDB

Service Metrics Endpoints

ServicePortPathTechnology
AI Service8000/metricsprometheus_client (Python)
Data Quality Service8001/metricsprometheus_client (Python)
ML Service8002/metricsprometheus_client (Python)
Pipeline Service8003/metricsprometheus_client (Python)
Ops Agent Service8004/metricsprometheus_client (Python)
Ontology Service8005/metricsprometheus_client (Python)
dbt Server8006/metricsprometheus-fastapi-instrumentator (Python)
Auth Proxy8007/metricsprometheus_client (Python)
Query Engine8080/actuator/prometheusMicrometer (Spring Boot)
IAM Service8081/actuator/prometheusMicrometer (Spring Boot)
Tenant Service8082/actuator/prometheusMicrometer (Spring Boot)
API Gateway8080/actuator/prometheusMicrometer (Spring Boot)
Config Service8888/actuator/prometheusMicrometer (Spring Boot)
Notification Service8085/actuator/prometheusMicrometer (Spring Boot)
Audit Service8086/actuator/prometheusMicrometer (Spring Boot)
Billing Service8087/actuator/prometheusMicrometer (Spring Boot)
Observability API8088/actuator/prometheusMicrometer (Spring Boot)
Infrastructure Service8089/actuator/prometheusMicrometer (Spring Boot)
Platform Registry8084/actuator/prometheusMicrometer (Spring Boot)
Catalog Service8090/actuator/prometheusMicrometer (Spring Boot)
BI Service8091/actuator/prometheusMicrometer (Spring Boot)
Governance Service8092/actuator/prometheusMicrometer (Spring Boot)
Semantic Layer8093/actuator/prometheusMicrometer (Spring Boot)
Data Plane Agent8094/actuator/prometheusMicrometer (Spring Boot)

Key Metric Families

HTTP and LLM Metrics

MetricTypeDescription
matih_http_requests_totalCounterTotal HTTP requests by status class
matih_http_request_duration_secondsHistogramRequest latency distribution
matih_llm_requests_totalCounterLLM API calls
matih_llm_tokens_totalCounterLLM token consumption
matih_active_sessionsGaugeCurrently active user sessions

Provisioning Pipeline Metrics

MetricTypeDescription
matih_provisioning_started_totalCounterTenant provisioning started
matih_provisioning_completed_totalCounterTenant provisioning completed
matih_provisioning_failed_totalCounterTenant provisioning failed
matih_provisioning_step_duration_secondsHistogramPer-step provisioning duration
matih_provisioning_k8s_apiTimerPer-method K8s API call duration (createNamespace, createResourceQuota, etc.)
matih_provisioning_helm_deployTimerPer-service Helm deployment duration
matih_provisioning_data_infraTimerData infrastructure operation duration (provision_database, setup_cache, etc.)

Domain Metrics

MetricTypeDescription
matih_gateway_requests_totalCounterGateway request counts by service, tenant, and status
matih_gateway_auth_failures_totalCounterAuthentication failure counts by reason
matih_gateway_route_latency_secondsTimerRoute-level latency with percentile histogram
matih_billing_operations_totalCounterBilling operation counts by operation and tier
matih_billing_metering_records_totalCounterMetering record ingestion count
matih_billing_invoice_generation_secondsTimerInvoice generation duration
matih_notification_sent_totalCounterNotification delivery counts by channel and status
matih_notification_delivery_failures_totalCounterDelivery failure counts by channel and error type
matih_notification_latency_secondsTimerNotification delivery latency
matih_iam_auth_attempts_totalCounterAuthentication attempt counts by outcome
matih_iam_token_generations_totalCounterToken generation counts
matih_iam_role_operations_totalCounterRole management operation counts

dbt Server Metrics

MetricTypeDescription
dbt_run_duration_secondsHistogramdbt command execution duration
dbt_run_totalCounterTotal dbt command executions
dbt_model_compilation_errors_totalCounterdbt model compilation errors
dbt_active_runsGaugeCurrently running dbt commands
dbt_test_results_totalCounterdbt test results (pass/fail/warn/error)

Accessing Prometheus

# Port forward for local access
kubectl port-forward svc/monitoring-prometheus 9090:9090 -n matih-monitoring

Then access the Prometheus UI at http://localhost:9090.


Federation

For multi-cluster deployments, Prometheus federation is used to aggregate metrics from regional clusters into a central Prometheus instance.