Alert Rules Library
MATIH maintains a comprehensive library of Prometheus alerting rules organized by category. Each alert includes severity classification, meaningful annotations with runbook links, and dashboard URLs for quick investigation.
Alert Categories
| Category | Description | Rule File |
|---|---|---|
| Service Health | Service availability and resource usage | matih-alerts.yml |
| HTTP Errors | Request error rates and latency | matih-alerts.yml |
| Provisioning | Tenant provisioning pipeline | provisioning.yaml |
| SLO | Service level objective compliance | slo.yaml |
| Infrastructure | Kubernetes and database health | infrastructure.yaml |
Service Health Alerts
| Alert | Condition | Duration | Severity |
|---|---|---|---|
ServiceDown | up == 0 | 2 minutes | critical |
ServiceHighRestartRate | More than 5 restarts per hour | 5 minutes | warning |
HighMemoryUsage | Memory above 85% of limit | 5 minutes | warning |
HighCPUUsage | CPU rate above 90% | 10 minutes | warning |
HTTP Error Alerts
| Alert | Condition | Duration | Severity |
|---|---|---|---|
HighErrorRate | 5xx rate above 5% | 5 minutes | critical |
High4xxRate | 4xx rate above 20% | 5 minutes | warning |
SlowRequests | p95 latency above 5 seconds | 5 minutes | warning |
Provisioning Alerts
| Alert | Condition | Duration | Severity |
|---|---|---|---|
ProvisioningFailureRateHigh | Failure rate above 10% in 15 minutes | 5 minutes | critical |
ProvisioningStuckInProgress | More than 10 active for 30 minutes | 30 minutes | warning |
ProvisioningConsecutiveFailures | 5+ failures with 0 successes in 10 minutes | 5 minutes | critical |
ProvisioningStepSlowdown | p90 step duration above 120 seconds | 10 minutes | warning |
SLO Alerts
| Alert | Condition | Duration | Severity |
|---|---|---|---|
SLOErrorBudgetBurnRateHigh | Burn rate exceeds 14.4x (1-hour window) | 2 minutes | critical |
SLOLatencyBudgetLow | Latency compliance below 99% | 10 minutes | warning |
Infrastructure Alerts
| Alert | Condition | Duration | Severity |
|---|---|---|---|
PostgreSQLDown | Database unreachable | 1 minute | critical |
KafkaBrokerDown | Kafka broker offline | 2 minutes | critical |
HighDiskUsage | Disk above 85% | 15 minutes | warning |
CertificateExpiringSoon | TLS certificate expires in under 14 days | 1 hour | warning |
PVCNearlyFull | PVC above 90% capacity | 10 minutes | warning |
Alert Annotation Standards
Every alert must include:
| Annotation | Description | Required |
|---|---|---|
summary | Brief one-line description | Yes |
description | Detailed description with metric values | Yes |
runbook_url | Link to the operational runbook | Yes |
dashboard_url | Link to the relevant Grafana dashboard | Recommended |
Adding New Alerts
- Define the alert in a PrometheusRule CRD or YAML file
- Include all required annotations
- Set appropriate severity and category labels
- Test with
promtool check rules rules-file.yaml - Deploy by applying the PrometheusRule to the monitoring namespace