SLO Monitoring
MATIH tracks Service Level Objectives (SLOs) using Prometheus recording rules and Grafana dashboards. SLOs define the acceptable thresholds for availability, latency, and error rates, with error budget tracking to guide release decisions.
SLO Definitions
| SLO | Target | Measurement Window |
|---|---|---|
| Availability | 99.9% | 30-day rolling |
| Latency (p99) | Under 1 second | 5-minute window |
| Error Rate | Under 1% | 5-minute window |
| Provisioning Success | 95% | 24-hour rolling |
SLI Metrics
Service Level Indicators (SLIs) are the raw metrics used to compute SLO compliance:
| SLI | Metric | Good Event | Total Event |
|---|---|---|---|
| Availability | up | up == 1 | All scrape attempts |
| Request Success | matih_http_requests_total | Non-5xx responses | All responses |
| Latency | matih_http_request_duration_seconds | Requests under 1s | All requests |
| Provisioning | matih_provisioning_completed_total | Successful completions | All attempts |
Recording Rules
groups:
- name: slo-recording
interval: 30s
rules:
- record: matih:slo:availability:ratio
expr: avg_over_time(up{job=~"matih.*"}[30d])
- record: matih:slo:error_rate:5m
expr: |
sum(rate(matih_http_requests_total{status_class="5xx"}[5m]))
/
sum(rate(matih_http_requests_total[5m]))
- record: matih:slo:latency_under_1s:ratio
expr: |
sum(rate(matih_http_request_duration_seconds_bucket{le="1.0"}[5m]))
/
sum(rate(matih_http_request_duration_seconds_count[5m]))Error Budget
The error budget represents the allowed amount of downtime or errors within the SLO window:
error_budget = 1 - SLO_target
budget_consumed = 1 - (good_events / total_events)
budget_remaining = error_budget - budget_consumedFor a 99.9% availability SLO over 30 days:
- Error budget: 0.1% = 43.2 minutes of downtime
- If 20 minutes of downtime have occurred: 53.7% of budget consumed
SLO Alerting Rules
groups:
- name: slo-alerts
rules:
- alert: SLOErrorBudgetBurnRateHigh
expr: |
matih:slo:error_rate:5m > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burn rate is too high"
description: "Current error rate will exhaust the 30-day error budget in under 1 hour."
- alert: SLOLatencyBudgetLow
expr: |
matih:slo:latency_under_1s:ratio < 0.99
for: 10m
labels:
severity: warning
annotations:
summary: "Latency SLO compliance is below target"Multi-Window Burn Rate
MATIH implements the multi-window, multi-burn-rate alerting strategy recommended by Google SRE:
| Window | Burn Rate | Alert After | Severity |
|---|---|---|---|
| 1 hour | 14.4x | 2 minutes | Critical (page) |
| 6 hours | 6x | 5 minutes | Critical (page) |
| 1 day | 3x | 30 minutes | Warning (ticket) |
| 3 days | 1x | 6 hours | Warning (ticket) |
Grafana SLO Dashboard
The SLO dashboard displays:
| Panel | Description |
|---|---|
| SLO Compliance | Current compliance percentage vs. target |
| Error Budget Remaining | Percentage of error budget remaining |
| Burn Rate | Current burn rate vs. threshold |
| SLO History | 30-day trend of SLO compliance |
| Budget Forecast | Projected budget exhaustion date |