MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Prometheus Rules

Prometheus Rules

MATIH uses Prometheus recording rules for pre-computed aggregations and alerting rules for automated incident detection. Rules are organized into groups covering service health, HTTP errors, provisioning, and SLO compliance.


Rule Files

FileDescription
prometheus/alerts/matih-alerts.ymlService health and HTTP error alerts
prometheus/rules/provisioning.yamlProvisioning pipeline alerts
prometheus/rules/platform.yamlPlatform-level recording rules
prometheus/rules/slo.yamlSLO compliance rules

Service Health Alerts

AlertExpressionForSeverity
ServiceDownup == 02mcritical
ServiceHighRestartRateincrease(container_restarts_total[1h]) > 55mwarning
HighMemoryUsageMemory usage above 85% of limit5mwarning
HighCPUUsageCPU rate above 90%10mwarning

HTTP Error Alerts

AlertExpressionForSeverity
HighErrorRate5xx rate above 5%5mcritical
High4xxRate4xx rate above 20%5mwarning
SlowRequestsp95 latency above 5 seconds5mwarning

Provisioning Alerts

AlertExpressionForSeverity
ProvisioningFailureRateHighFailure rate above 10%5mcritical
ProvisioningStuckInProgressMore than 10 active for 30m30mwarning
ProvisioningConsecutiveFailures5+ failures with 0 successes5mcritical
ProvisioningStepSlowdownp90 step duration above 120s10mwarning

Recording Rules

Recording rules pre-compute expensive aggregations for dashboard performance:

groups:
  - name: matih.recording
    interval: 30s
    rules:
      - record: matih:http_requests:rate5m
        expr: sum(rate(matih_http_requests_total[5m])) by (job, status_class)
 
      - record: matih:http_latency:p99
        expr: histogram_quantile(0.99, sum(rate(matih_http_request_duration_seconds_bucket[5m])) by (le, job))
 
      - record: matih:error_rate:5m
        expr: |
          sum(rate(matih_http_requests_total{status_class="5xx"}[5m])) by (job)
          /
          sum(rate(matih_http_requests_total[5m])) by (job)

Alert Annotations

All alerts include standard annotations:

AnnotationDescription
summaryBrief description of the alert
descriptionDetailed description with current values
runbook_urlLink to the relevant operational runbook
dashboard_urlLink to the relevant Grafana dashboard

Adding Custom Rules

To add a new alerting rule, create a PrometheusRule CRD in the matih-monitoring namespace:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: custom-alerts
  namespace: matih-monitoring
  labels:
    release: monitoring
spec:
  groups:
    - name: custom.alerts
      rules:
        - alert: CustomAlert
          expr: custom_metric > threshold
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Custom alert description"