Prometheus Rules
MATIH uses Prometheus recording rules for pre-computed aggregations and alerting rules for automated incident detection. Rules are organized into groups covering service health, HTTP errors, provisioning, and SLO compliance.
Rule Files
| File | Description |
|---|---|
prometheus/alerts/matih-alerts.yml | Service health and HTTP error alerts |
prometheus/rules/provisioning.yaml | Provisioning pipeline alerts |
prometheus/rules/platform.yaml | Platform-level recording rules |
prometheus/rules/slo.yaml | SLO compliance rules |
Service Health Alerts
| Alert | Expression | For | Severity |
|---|---|---|---|
ServiceDown | up == 0 | 2m | critical |
ServiceHighRestartRate | increase(container_restarts_total[1h]) > 5 | 5m | warning |
HighMemoryUsage | Memory usage above 85% of limit | 5m | warning |
HighCPUUsage | CPU rate above 90% | 10m | warning |
HTTP Error Alerts
| Alert | Expression | For | Severity |
|---|---|---|---|
HighErrorRate | 5xx rate above 5% | 5m | critical |
High4xxRate | 4xx rate above 20% | 5m | warning |
SlowRequests | p95 latency above 5 seconds | 5m | warning |
Provisioning Alerts
| Alert | Expression | For | Severity |
|---|---|---|---|
ProvisioningFailureRateHigh | Failure rate above 10% | 5m | critical |
ProvisioningStuckInProgress | More than 10 active for 30m | 30m | warning |
ProvisioningConsecutiveFailures | 5+ failures with 0 successes | 5m | critical |
ProvisioningStepSlowdown | p90 step duration above 120s | 10m | warning |
Recording Rules
Recording rules pre-compute expensive aggregations for dashboard performance:
groups:
- name: matih.recording
interval: 30s
rules:
- record: matih:http_requests:rate5m
expr: sum(rate(matih_http_requests_total[5m])) by (job, status_class)
- record: matih:http_latency:p99
expr: histogram_quantile(0.99, sum(rate(matih_http_request_duration_seconds_bucket[5m])) by (le, job))
- record: matih:error_rate:5m
expr: |
sum(rate(matih_http_requests_total{status_class="5xx"}[5m])) by (job)
/
sum(rate(matih_http_requests_total[5m])) by (job)Alert Annotations
All alerts include standard annotations:
| Annotation | Description |
|---|---|
summary | Brief description of the alert |
description | Detailed description with current values |
runbook_url | Link to the relevant operational runbook |
dashboard_url | Link to the relevant Grafana dashboard |
Adding Custom Rules
To add a new alerting rule, create a PrometheusRule CRD in the matih-monitoring namespace:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: custom-alerts
namespace: matih-monitoring
labels:
release: monitoring
spec:
groups:
- name: custom.alerts
rules:
- alert: CustomAlert
expr: custom_metric > threshold
for: 5m
labels:
severity: warning
annotations:
summary: "Custom alert description"