MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Custom Metrics

Custom Metrics

MATIH services export custom application-level metrics using Prometheus client libraries. Python services use prometheus_client, and Java services use Spring Boot Micrometer. These custom metrics complement the default infrastructure metrics with business and operational insights.


Python Services (AI Service)

Python services use the prometheus_client library:

from prometheus_client import Counter, Histogram, Gauge
 
# Request metrics
http_requests_total = Counter(
    "matih_http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status_class"],
)
 
http_request_duration = Histogram(
    "matih_http_request_duration_seconds",
    "HTTP request duration",
    ["method", "endpoint"],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
 
# LLM metrics
llm_requests_total = Counter(
    "matih_llm_requests_total",
    "Total LLM API calls",
    ["model", "status"],
)
 
llm_tokens_total = Counter(
    "matih_llm_tokens_total",
    "Total LLM tokens consumed",
    ["model", "direction"],  # direction: input/output
)
 
# Active connections
active_sessions = Gauge(
    "matih_active_sessions",
    "Currently active user sessions",
    ["tenant_id"],
)

Java Services (Spring Boot)

Java services use Spring Boot Micrometer with Prometheus registry:

@Component
public class CustomMetrics {
    private final Counter provisioningStarted;
    private final Counter provisioningCompleted;
    private final Timer provisioningDuration;
 
    public CustomMetrics(MeterRegistry registry) {
        this.provisioningStarted = Counter.builder("matih.provisioning.started")
            .tag("type", "provisioning")
            .register(registry);
        this.provisioningCompleted = Counter.builder("matih.provisioning.completed")
            .tag("type", "provisioning")
            .register(registry);
        this.provisioningDuration = Timer.builder("matih.provisioning.duration")
            .register(registry);
    }
}

Metric Naming Conventions

ConventionExampleDescription
Prefixmatih_All custom metrics start with matih_
Counters_total suffixmatih_http_requests_total
Histograms_seconds or _bytes suffixmatih_http_request_duration_seconds
GaugesDescriptive namematih_active_sessions

Key Custom Metrics

MetricTypeLabelsDescription
matih_http_requests_totalCountermethod, endpoint, status_classHTTP request counts
matih_http_request_duration_secondsHistogrammethod, endpointRequest latency
matih_llm_requests_totalCountermodel, statusLLM API calls
matih_llm_tokens_totalCountermodel, directionToken consumption
matih_provisioning_started_totalCountertypeProvisioning starts
matih_provisioning_completed_totalCountertypeProvisioning completions
matih_provisioning_failed_totalCountertype, stepProvisioning failures
matih_provisioning_step_duration_secondsHistogramstepStep durations
matih_active_sessionsGaugetenant_idActive sessions

Provisioning Pipeline Metrics

The tenant provisioning pipeline exports detailed metrics at every stage of the provisioning lifecycle. These metrics are recorded by ProvisioningMetrics (a Spring @Component with MeterRegistry injection) and are wired into both ProvisioningStepExecutor and ProvisioningService.

MetricTypeLabelsDescription
matih_provisioning_k8s_apiTimeroperation, tier, outcomePer-method K8s API call duration (createNamespace, createResourceQuota, etc.)
matih_provisioning_helm_deployTimerservice, tier, outcomePer-service Helm deployment duration
matih_provisioning_data_infraTimeroperation, tier, outcomeData infrastructure operation duration (provision_database, setup_cache, etc.)

Domain Metrics by Service

Each control-plane service exports domain-specific metrics via dedicated @Component metrics classes that follow the same MeterRegistry injection pattern.

API Gateway (GatewayMetrics)

MetricTypeLabelsDescription
matih_gateway_requests_totalCounterservice, tenant, statusGateway request counts
matih_gateway_auth_failures_totalCounterreasonAuthentication failure counts
matih_gateway_route_latency_secondsTimerrouteRoute-level latency with percentile histogram

Billing Service (BillingMetrics)

MetricTypeLabelsDescription
matih_billing_operations_totalCounteroperation, tierBilling operation counts
matih_billing_metering_records_totalCounter-Metering record ingestion count
matih_billing_invoice_generation_secondsTimer-Invoice generation duration with percentile histogram

Notification Service (NotificationMetrics)

MetricTypeLabelsDescription
matih_notification_sent_totalCounterchannel, statusNotification delivery counts
matih_notification_delivery_failures_totalCounterchannel, error_typeDelivery failure counts
matih_notification_latency_secondsTimerchannelNotification delivery latency with percentile histogram

IAM Service (IamMetrics)

MetricTypeLabelsDescription
matih_iam_auth_attempts_totalCounteroutcomeAuthentication attempt counts
matih_iam_token_generations_totalCounter-Token generation counts
matih_iam_role_operations_totalCounteroperationRole management operation counts

dbt Server Metrics

The dbt-server (Python/FastAPI) exports both auto-instrumented HTTP metrics via prometheus-fastapi-instrumentator and custom dbt operation metrics.

MetricTypeLabelsDescription
dbt_run_duration_secondsHistogramcommand, project, outcomedbt command execution duration (buckets: 1s to 30min)
dbt_run_totalCountercommand, outcomeTotal dbt command executions
dbt_model_compilation_errors_totalCounterprojectdbt model compilation errors
dbt_active_runsGauge-Currently running dbt commands
dbt_test_results_totalCounteroutcomedbt test results (pass/fail/warn/error)

Exposing Metrics

Python (FastAPI)

from prometheus_client import make_asgi_app
 
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

Java (Spring Boot)

Add to application.yml:

management:
  endpoints:
    web:
      exposure:
        include: prometheus, health
  metrics:
    export:
      prometheus:
        enabled: true

Best Practices

  • Always use the matih_ prefix for custom metrics
  • Keep label cardinality low (avoid high-cardinality labels like user_id)
  • Use histograms over summaries for latency metrics (histograms are aggregatable)
  • Define explicit histogram buckets matching your SLO thresholds