Custom Metrics

MATIH services export custom application-level metrics using Prometheus client libraries. Python services use prometheus_client, and Java services use Spring Boot Micrometer. These custom metrics complement the default infrastructure metrics with business and operational insights.

Python Services (AI Service)

Python services use the prometheus_client library:

from prometheus_client import Counter, Histogram, Gauge
 
# Request metrics
http_requests_total = Counter(
    "matih_http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status_class"],
)
 
http_request_duration = Histogram(
    "matih_http_request_duration_seconds",
    "HTTP request duration",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
 
# LLM metrics
llm_requests_total = Counter(
    "matih_llm_requests_total",
    "Total LLM API calls",
    ["model", "status"],
)
 
llm_tokens_total = Counter(
    "matih_llm_tokens_total",
    "Total LLM tokens consumed",
    ["model", "direction"],  # direction: input/output
)
 
# Active connections
active_sessions = Gauge(
    "matih_active_sessions",
    "Currently active user sessions",
    ["tenant_id"],
)

Java Services (Spring Boot)

Java services use Spring Boot Micrometer with Prometheus registry:

@Component
public class CustomMetrics {
    private final Counter provisioningStarted;
    private final Counter provisioningCompleted;
    private final Timer provisioningDuration;
 
    public CustomMetrics(MeterRegistry registry) {
        this.provisioningStarted = Counter.builder("matih.provisioning.started")
            .tag("type", "provisioning")
            .register(registry);
        this.provisioningCompleted = Counter.builder("matih.provisioning.completed")
            .tag("type", "provisioning")
            .register(registry);
        this.provisioningDuration = Timer.builder("matih.provisioning.duration")
            .register(registry);
    }
}

Metric Naming Conventions

Convention	Example	Description
Prefix	`matih_`	All custom metrics start with `matih_`
Counters	`_total` suffix	`matih_http_requests_total`
Histograms	`_seconds` or `_bytes` suffix	`matih_http_request_duration_seconds`
Gauges	Descriptive name	`matih_active_sessions`

Key Custom Metrics

Metric	Type	Labels	Description
`matih_http_requests_total`	Counter	method, endpoint, status_class	HTTP request counts
`matih_http_request_duration_seconds`	Histogram	method, endpoint	Request latency
`matih_llm_requests_total`	Counter	model, status	LLM API calls
`matih_llm_tokens_total`	Counter	model, direction	Token consumption
`matih_provisioning_started_total`	Counter	type	Provisioning starts
`matih_provisioning_completed_total`	Counter	type	Provisioning completions
`matih_provisioning_failed_total`	Counter	type, step	Provisioning failures
`matih_provisioning_step_duration_seconds`	Histogram	step	Step durations
`matih_active_sessions`	Gauge	tenant_id	Active sessions

Provisioning Pipeline Metrics

The tenant provisioning pipeline exports detailed metrics at every stage of the provisioning lifecycle. These metrics are recorded by ProvisioningMetrics (a Spring @Component with MeterRegistry injection) and are wired into both ProvisioningStepExecutor and ProvisioningService.

Metric	Type	Labels	Description
`matih_provisioning_k8s_api`	Timer	operation, tier, outcome	Per-method K8s API call duration (createNamespace, createResourceQuota, etc.)
`matih_provisioning_helm_deploy`	Timer	service, tier, outcome	Per-service Helm deployment duration
`matih_provisioning_data_infra`	Timer	operation, tier, outcome	Data infrastructure operation duration (provision_database, setup_cache, etc.)

Domain Metrics by Service

Each control-plane service exports domain-specific metrics via dedicated @Component metrics classes that follow the same MeterRegistry injection pattern.

API Gateway (`GatewayMetrics`)

Metric	Type	Labels	Description
`matih_gateway_requests_total`	Counter	service, tenant, status	Gateway request counts
`matih_gateway_auth_failures_total`	Counter	reason	Authentication failure counts
`matih_gateway_route_latency_seconds`	Timer	route	Route-level latency with percentile histogram

Billing Service (`BillingMetrics`)

Metric	Type	Labels	Description
`matih_billing_operations_total`	Counter	operation, tier	Billing operation counts
`matih_billing_metering_records_total`	Counter	-	Metering record ingestion count
`matih_billing_invoice_generation_seconds`	Timer	-	Invoice generation duration with percentile histogram

Notification Service (`NotificationMetrics`)

Metric	Type	Labels	Description
`matih_notification_sent_total`	Counter	channel, status	Notification delivery counts
`matih_notification_delivery_failures_total`	Counter	channel, error_type	Delivery failure counts
`matih_notification_latency_seconds`	Timer	channel	Notification delivery latency with percentile histogram

IAM Service (`IamMetrics`)

Metric	Type	Labels	Description
`matih_iam_auth_attempts_total`	Counter	outcome	Authentication attempt counts
`matih_iam_token_generations_total`	Counter	-	Token generation counts
`matih_iam_role_operations_total`	Counter	operation	Role management operation counts

dbt Server Metrics

The dbt-server (Python/FastAPI) exports both auto-instrumented HTTP metrics via prometheus-fastapi-instrumentator and custom dbt operation metrics.

Metric	Type	Labels	Description
`dbt_run_duration_seconds`	Histogram	command, project, outcome	dbt command execution duration (buckets: 1s to 30min)
`dbt_run_total`	Counter	command, outcome	Total dbt command executions
`dbt_model_compilation_errors_total`	Counter	project	dbt model compilation errors
`dbt_active_runs`	Gauge	-	Currently running dbt commands
`dbt_test_results_total`	Counter	outcome	dbt test results (pass/fail/warn/error)

Exposing Metrics

Python (FastAPI)

from prometheus_client import make_asgi_app
 
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

Java (Spring Boot)

Add to application.yml:

management:
  endpoints:
    web:
      exposure:
        include: prometheus, health
  metrics:
    export:
      prometheus:
        enabled: true

Best Practices

Always use the matih_ prefix for custom metrics
Keep label cardinality low (avoid high-cardinality labels like user_id)
Use histograms over summaries for latency metrics (histograms are aggregatable)
Define explicit histogram buckets matching your SLO thresholds

Grafana Data Sources SLO Monitoring