MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Overview

Chapter 19: Observability and Operations

Operating a platform with 30+ microservices, 12+ data infrastructure components, and multi-tenant workloads demands comprehensive observability. MATIH implements a three-pillar observability strategy -- metrics, traces, and logs -- augmented with alerting, operational runbooks, disaster recovery procedures, and automated health checks. This chapter provides a complete guide to monitoring, diagnosing, and operating the MATIH platform in production.


What You Will Learn

By the end of this chapter, you will understand:

  • Monitoring with Prometheus for metrics collection, Grafana for visualization, and ServiceMonitor CRDs for automatic target discovery
  • Distributed tracing with OpenTelemetry instrumentation and Tempo as the trace storage backend, enabling request-level visibility across microservice boundaries
  • Structured logging with structlog (Python) and Spring logging (Java), collected by Fluent-bit/Promtail into Loki for centralized querying
  • Alerting through Alertmanager with severity-based routing, PagerDuty integration, and escalation policies
  • Operational runbooks for common scenarios including service failures, database issues, scaling events, and tenant provisioning problems
  • Disaster recovery procedures for data backup, service restoration, and cross-region failover
  • Health checks using platform-wide validation scripts and per-service health endpoints

Chapter Structure

SectionDescriptionAudience
MonitoringPrometheus, Grafana, ServiceMonitors, dashboards, and custom metricsSREs, platform engineers
Distributed TracingOpenTelemetry, Tempo, trace propagation, and span analysisBackend developers, SREs
Structured Loggingstructlog, Spring logging, Loki, Fluent-bit/Promtail, log queriesAll developers
AlertingAlertmanager, PagerDuty, severity routing, incident responseSREs, on-call engineers
Operational RunbooksStep-by-step procedures for common operational scenariosSREs, on-call engineers
Disaster RecoveryBackup, restore, failover, and recovery proceduresPlatform engineers, SREs
Health ChecksPlatform status scripts, per-service health endpointsAll engineers

Observability Architecture

+---------------------------------------------------------------+
|  Application Services                                          |
|  +-- Metrics (Prometheus client) --> Prometheus --> Grafana    |
|  +-- Traces  (OpenTelemetry SDK) --> Tempo --> Grafana        |
|  +-- Logs    (structlog/Spring)  --> Fluent-bit/Promtail      |
|  |                                     --> Loki --> Grafana    |
+---------------------------------------------------------------+
                      |
                      v
+---------------------------------------------------------------+
|  matih-observability namespace                                 |
|                                                                |
|  +-- Prometheus --+  +-- Grafana ---+  +-- Loki --------+    |
|  | Server         |  | Dashboards   |  | Log aggregation|    |
|  | Alertmanager   |  | Data sources |  | Retention: 7d  |    |
|  | Retention: 15d |  | Alerts       |  +---------------+    |
|  +---------------+  +--------------+                         |
|                                                                |
|  +-- Tempo --------+  +-- Promtail/Fluent-bit (DaemonSet) -+ |
|  | Trace storage    |  | Log collection from every node     | |
|  | Retention: 7d    |  | Labels: namespace, pod, container  | |
|  +------------------+  +-----------------------------------+ |
+---------------------------------------------------------------+
                      |
                      v
+---------------------------------------------------------------+
|  Alert Routing                                                 |
|  +-- Alertmanager --> Slack (#matih-alerts)                   |
|  |               --> PagerDuty (critical only)                |
|  |               --> Email (weekly digests)                    |
+---------------------------------------------------------------+

Observability Stack Components

ComponentVersionNamespacePurpose
Prometheus2.48+matih-observabilityMetrics collection and storage
Alertmanager0.26+matih-observabilityAlert routing and deduplication
Grafana10.2+matih-observabilityDashboard visualization
Loki2.9+matih-observabilityLog aggregation
Promtail2.9+matih-observabilityLog collection (DaemonSet)
Tempo2.3+matih-observabilityDistributed trace storage
OpenTelemetry Collector0.90+matih-observabilityTrace/metric collection gateway
Observability API1.0.0matih-observabilityUnified query API for dashboards

Monitoring Namespaces

MATIH separates monitoring resources from application resources using dedicated monitoring namespaces:

NamespaceContentsPurpose
matih-observabilityPrometheus, Grafana, Loki, TempoObservability infrastructure
matih-monitoring-control-planeServiceMonitors, PrometheusRulesCP monitoring configuration
matih-monitoring-data-planeServiceMonitors, PrometheusRulesDP monitoring configuration

This separation provides:

  • Access control: Monitoring admins can manage ServiceMonitors without accessing application resources
  • Blast radius: Monitoring configuration changes do not affect application deployments
  • Audit trail: Changes to monitoring are tracked separately from application changes

Key Metrics at a Glance

Metric CategoryExample MetricAlert Threshold
Availabilityup metric filtered by job== 0 for 5 min
Latencyhttp_server_requests_seconds_bucketp95 > 2s for 5 min
Error rateHTTP request count filtered by 5xx status> 5% for 5 min
Saturationcontainer_memory_working_set_bytes> 80% of limit
Throughputhttp_server_requests_seconds_countBaseline dependent
AI-specificai_service_llm_token_usage_rate> budget threshold
Kafkakafka_consumer_group_lag> 10000 messages
Databasepg_stat_activity_count> 80% of max_connections

Prerequisites

Before working with MATIH observability:

  • Familiarity with Prometheus query language (PromQL)
  • Understanding of distributed tracing concepts (spans, traces, context propagation)
  • Basic knowledge of log querying with LogQL (Loki)
  • Access to the MATIH Grafana instance

Navigation