Custom Metrics
MATIH services export custom application-level metrics using Prometheus client libraries. Python services use prometheus_client, and Java services use Spring Boot Micrometer. These custom metrics complement the default infrastructure metrics with business and operational insights.
Python Services (AI Service)
Python services use the prometheus_client library:
from prometheus_client import Counter, Histogram, Gauge
# Request metrics
http_requests_total = Counter(
"matih_http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status_class"],
)
http_request_duration = Histogram(
"matih_http_request_duration_seconds",
"HTTP request duration",
["method", "endpoint"],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
# LLM metrics
llm_requests_total = Counter(
"matih_llm_requests_total",
"Total LLM API calls",
["model", "status"],
)
llm_tokens_total = Counter(
"matih_llm_tokens_total",
"Total LLM tokens consumed",
["model", "direction"], # direction: input/output
)
# Active connections
active_sessions = Gauge(
"matih_active_sessions",
"Currently active user sessions",
["tenant_id"],
)Java Services (Spring Boot)
Java services use Spring Boot Micrometer with Prometheus registry:
@Component
public class CustomMetrics {
private final Counter provisioningStarted;
private final Counter provisioningCompleted;
private final Timer provisioningDuration;
public CustomMetrics(MeterRegistry registry) {
this.provisioningStarted = Counter.builder("matih.provisioning.started")
.tag("type", "provisioning")
.register(registry);
this.provisioningCompleted = Counter.builder("matih.provisioning.completed")
.tag("type", "provisioning")
.register(registry);
this.provisioningDuration = Timer.builder("matih.provisioning.duration")
.register(registry);
}
}Metric Naming Conventions
| Convention | Example | Description |
|---|---|---|
| Prefix | matih_ | All custom metrics start with matih_ |
| Counters | _total suffix | matih_http_requests_total |
| Histograms | _seconds or _bytes suffix | matih_http_request_duration_seconds |
| Gauges | Descriptive name | matih_active_sessions |
Key Custom Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
matih_http_requests_total | Counter | method, endpoint, status_class | HTTP request counts |
matih_http_request_duration_seconds | Histogram | method, endpoint | Request latency |
matih_llm_requests_total | Counter | model, status | LLM API calls |
matih_llm_tokens_total | Counter | model, direction | Token consumption |
matih_provisioning_started_total | Counter | type | Provisioning starts |
matih_provisioning_completed_total | Counter | type | Provisioning completions |
matih_provisioning_failed_total | Counter | type, step | Provisioning failures |
matih_provisioning_step_duration_seconds | Histogram | step | Step durations |
matih_active_sessions | Gauge | tenant_id | Active sessions |
Provisioning Pipeline Metrics
The tenant provisioning pipeline exports detailed metrics at every stage of the provisioning lifecycle. These metrics are recorded by ProvisioningMetrics (a Spring @Component with MeterRegistry injection) and are wired into both ProvisioningStepExecutor and ProvisioningService.
| Metric | Type | Labels | Description |
|---|---|---|---|
matih_provisioning_k8s_api | Timer | operation, tier, outcome | Per-method K8s API call duration (createNamespace, createResourceQuota, etc.) |
matih_provisioning_helm_deploy | Timer | service, tier, outcome | Per-service Helm deployment duration |
matih_provisioning_data_infra | Timer | operation, tier, outcome | Data infrastructure operation duration (provision_database, setup_cache, etc.) |
Domain Metrics by Service
Each control-plane service exports domain-specific metrics via dedicated @Component metrics classes that follow the same MeterRegistry injection pattern.
API Gateway (GatewayMetrics)
| Metric | Type | Labels | Description |
|---|---|---|---|
matih_gateway_requests_total | Counter | service, tenant, status | Gateway request counts |
matih_gateway_auth_failures_total | Counter | reason | Authentication failure counts |
matih_gateway_route_latency_seconds | Timer | route | Route-level latency with percentile histogram |
Billing Service (BillingMetrics)
| Metric | Type | Labels | Description |
|---|---|---|---|
matih_billing_operations_total | Counter | operation, tier | Billing operation counts |
matih_billing_metering_records_total | Counter | - | Metering record ingestion count |
matih_billing_invoice_generation_seconds | Timer | - | Invoice generation duration with percentile histogram |
Notification Service (NotificationMetrics)
| Metric | Type | Labels | Description |
|---|---|---|---|
matih_notification_sent_total | Counter | channel, status | Notification delivery counts |
matih_notification_delivery_failures_total | Counter | channel, error_type | Delivery failure counts |
matih_notification_latency_seconds | Timer | channel | Notification delivery latency with percentile histogram |
IAM Service (IamMetrics)
| Metric | Type | Labels | Description |
|---|---|---|---|
matih_iam_auth_attempts_total | Counter | outcome | Authentication attempt counts |
matih_iam_token_generations_total | Counter | - | Token generation counts |
matih_iam_role_operations_total | Counter | operation | Role management operation counts |
dbt Server Metrics
The dbt-server (Python/FastAPI) exports both auto-instrumented HTTP metrics via prometheus-fastapi-instrumentator and custom dbt operation metrics.
| Metric | Type | Labels | Description |
|---|---|---|---|
dbt_run_duration_seconds | Histogram | command, project, outcome | dbt command execution duration (buckets: 1s to 30min) |
dbt_run_total | Counter | command, outcome | Total dbt command executions |
dbt_model_compilation_errors_total | Counter | project | dbt model compilation errors |
dbt_active_runs | Gauge | - | Currently running dbt commands |
dbt_test_results_total | Counter | outcome | dbt test results (pass/fail/warn/error) |
Exposing Metrics
Python (FastAPI)
from prometheus_client import make_asgi_app
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)Java (Spring Boot)
Add to application.yml:
management:
endpoints:
web:
exposure:
include: prometheus, health
metrics:
export:
prometheus:
enabled: trueBest Practices
- Always use the
matih_prefix for custom metrics - Keep label cardinality low (avoid high-cardinality labels like user_id)
- Use histograms over summaries for latency metrics (histograms are aggregatable)
- Define explicit histogram buckets matching your SLO thresholds