Alertmanager
Alertmanager handles alert deduplication, grouping, routing, and notification delivery for all Prometheus-generated alerts.
Alert Routing
route:
receiver: "default"
group_by: ["alertname", "namespace", "service"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: "pagerduty-critical"
repeat_interval: 1h
- match:
severity: warning
receiver: "slack-warnings"
repeat_interval: 4hReceivers
| Receiver | Channel | Severity |
|---|---|---|
| pagerduty-critical | PagerDuty | Critical |
| slack-warnings | Slack | Warning |
| email-daily | Info (daily digest) | |
| webhook-custom | HTTP POST | Custom integrations |
Key Alert Rules
| Alert | Condition | Severity |
|---|---|---|
| ServiceDown | Instance unreachable for 5m | Critical |
| HighErrorRate | Error rate > 5% for 10m | Warning |
| HighLatency | P95 latency > 5s for 10m | Warning |
| KafkaConsumerLag | Lag > 10000 for 15m | Warning |
| PostgresReplicationLag | Lag > 30s | Critical |
| DiskSpaceLow | Disk usage > 85% | Warning |
| MemoryPressure | Memory usage > 90% | Critical |
| PodCrashLooping | > 3 restarts in 10m | Critical |