Chapter 19: Observability and Operations
Operating a platform with 30+ microservices, 12+ data infrastructure components, and multi-tenant workloads demands comprehensive observability. MATIH implements a three-pillar observability strategy -- metrics, traces, and logs -- augmented with alerting, operational runbooks, disaster recovery procedures, and automated health checks. This chapter provides a complete guide to monitoring, diagnosing, and operating the MATIH platform in production.
What You Will Learn
By the end of this chapter, you will understand:
- Monitoring with Prometheus for metrics collection, Grafana for visualization, and ServiceMonitor CRDs for automatic target discovery
- Distributed tracing with OpenTelemetry instrumentation and Tempo as the trace storage backend, enabling request-level visibility across microservice boundaries
- Structured logging with structlog (Python) and Spring logging (Java), collected by Fluent-bit/Promtail into Loki for centralized querying
- Alerting through Alertmanager with severity-based routing, PagerDuty integration, and escalation policies
- Operational runbooks for common scenarios including service failures, database issues, scaling events, and tenant provisioning problems
- Disaster recovery procedures for data backup, service restoration, and cross-region failover
- Health checks using platform-wide validation scripts and per-service health endpoints
Chapter Structure
| Section | Description | Audience |
|---|---|---|
| Monitoring | Prometheus, Grafana, ServiceMonitors, dashboards, and custom metrics | SREs, platform engineers |
| Distributed Tracing | OpenTelemetry, Tempo, trace propagation, and span analysis | Backend developers, SREs |
| Structured Logging | structlog, Spring logging, Loki, Fluent-bit/Promtail, log queries | All developers |
| Alerting | Alertmanager, PagerDuty, severity routing, incident response | SREs, on-call engineers |
| Operational Runbooks | Step-by-step procedures for common operational scenarios | SREs, on-call engineers |
| Disaster Recovery | Backup, restore, failover, and recovery procedures | Platform engineers, SREs |
| Health Checks | Platform status scripts, per-service health endpoints | All engineers |
Observability Architecture
+---------------------------------------------------------------+
| Application Services |
| +-- Metrics (Prometheus client) --> Prometheus --> Grafana |
| +-- Traces (OpenTelemetry SDK) --> Tempo --> Grafana |
| +-- Logs (structlog/Spring) --> Fluent-bit/Promtail |
| | --> Loki --> Grafana |
+---------------------------------------------------------------+
|
v
+---------------------------------------------------------------+
| matih-observability namespace |
| |
| +-- Prometheus --+ +-- Grafana ---+ +-- Loki --------+ |
| | Server | | Dashboards | | Log aggregation| |
| | Alertmanager | | Data sources | | Retention: 7d | |
| | Retention: 15d | | Alerts | +---------------+ |
| +---------------+ +--------------+ |
| |
| +-- Tempo --------+ +-- Promtail/Fluent-bit (DaemonSet) -+ |
| | Trace storage | | Log collection from every node | |
| | Retention: 7d | | Labels: namespace, pod, container | |
| +------------------+ +-----------------------------------+ |
+---------------------------------------------------------------+
|
v
+---------------------------------------------------------------+
| Alert Routing |
| +-- Alertmanager --> Slack (#matih-alerts) |
| | --> PagerDuty (critical only) |
| | --> Email (weekly digests) |
+---------------------------------------------------------------+Observability Stack Components
| Component | Version | Namespace | Purpose |
|---|---|---|---|
| Prometheus | 2.48+ | matih-observability | Metrics collection and storage |
| Alertmanager | 0.26+ | matih-observability | Alert routing and deduplication |
| Grafana | 10.2+ | matih-observability | Dashboard visualization |
| Loki | 2.9+ | matih-observability | Log aggregation |
| Promtail | 2.9+ | matih-observability | Log collection (DaemonSet) |
| Tempo | 2.3+ | matih-observability | Distributed trace storage |
| OpenTelemetry Collector | 0.90+ | matih-observability | Trace/metric collection gateway |
| Observability API | 1.0.0 | matih-observability | Unified query API for dashboards |
Monitoring Namespaces
MATIH separates monitoring resources from application resources using dedicated monitoring namespaces:
| Namespace | Contents | Purpose |
|---|---|---|
| matih-observability | Prometheus, Grafana, Loki, Tempo | Observability infrastructure |
| matih-monitoring-control-plane | ServiceMonitors, PrometheusRules | CP monitoring configuration |
| matih-monitoring-data-plane | ServiceMonitors, PrometheusRules | DP monitoring configuration |
This separation provides:
- Access control: Monitoring admins can manage ServiceMonitors without accessing application resources
- Blast radius: Monitoring configuration changes do not affect application deployments
- Audit trail: Changes to monitoring are tracked separately from application changes
Key Metrics at a Glance
| Metric Category | Example Metric | Alert Threshold |
|---|---|---|
| Availability | up metric filtered by job | == 0 for 5 min |
| Latency | http_server_requests_seconds_bucket | p95 > 2s for 5 min |
| Error rate | HTTP request count filtered by 5xx status | > 5% for 5 min |
| Saturation | container_memory_working_set_bytes | > 80% of limit |
| Throughput | http_server_requests_seconds_count | Baseline dependent |
| AI-specific | ai_service_llm_token_usage_rate | > budget threshold |
| Kafka | kafka_consumer_group_lag | > 10000 messages |
| Database | pg_stat_activity_count | > 80% of max_connections |
Prerequisites
Before working with MATIH observability:
- Familiarity with Prometheus query language (PromQL)
- Understanding of distributed tracing concepts (spans, traces, context propagation)
- Basic knowledge of log querying with LogQL (Loki)
- Access to the MATIH Grafana instance