SLO Monitoring

MATIH tracks Service Level Objectives (SLOs) using Prometheus recording rules and Grafana dashboards. SLOs define the acceptable thresholds for availability, latency, and error rates, with error budget tracking to guide release decisions.

SLO Definitions

SLO	Target	Measurement Window
Availability	99.9%	30-day rolling
Latency (p99)	Under 1 second	5-minute window
Error Rate	Under 1%	5-minute window
Provisioning Success	95%	24-hour rolling

SLI Metrics

Service Level Indicators (SLIs) are the raw metrics used to compute SLO compliance:

SLI	Metric	Good Event	Total Event
Availability	`up`	`up == 1`	All scrape attempts
Request Success	`matih_http_requests_total`	Non-5xx responses	All responses
Latency	`matih_http_request_duration_seconds`	Requests under 1s	All requests
Provisioning	`matih_provisioning_completed_total`	Successful completions	All attempts

Recording Rules

groups:
  - name: slo-recording
    interval: 30s
    rules:
      - record: matih:slo:availability:ratio
        expr: avg_over_time(up{job=~"matih.*"}[30d])
 
      - record: matih:slo:error_rate:5m
        expr: |
          sum(rate(matih_http_requests_total{status_class="5xx"}[5m]))
          /
          sum(rate(matih_http_requests_total[5m]))
 
      - record: matih:slo:latency_under_1s:ratio
        expr: |
          sum(rate(matih_http_request_duration_seconds_bucket{le="1.0"}[5m]))
          /
          sum(rate(matih_http_request_duration_seconds_count[5m]))

Error Budget

The error budget represents the allowed amount of downtime or errors within the SLO window:

error_budget = 1 - SLO_target
budget_consumed = 1 - (good_events / total_events)
budget_remaining = error_budget - budget_consumed

For a 99.9% availability SLO over 30 days:

Error budget: 0.1% = 43.2 minutes of downtime
If 20 minutes of downtime have occurred: 53.7% of budget consumed

SLO Alerting Rules

groups:
  - name: slo-alerts
    rules:
      - alert: SLOErrorBudgetBurnRateHigh
        expr: |
          matih:slo:error_rate:5m > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burn rate is too high"
          description: "Current error rate will exhaust the 30-day error budget in under 1 hour."
 
      - alert: SLOLatencyBudgetLow
        expr: |
          matih:slo:latency_under_1s:ratio < 0.99
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Latency SLO compliance is below target"

Multi-Window Burn Rate

MATIH implements the multi-window, multi-burn-rate alerting strategy recommended by Google SRE:

Window	Burn Rate	Alert After	Severity
1 hour	14.4x	2 minutes	Critical (page)
6 hours	6x	5 minutes	Critical (page)
1 day	3x	30 minutes	Warning (ticket)
3 days	1x	6 hours	Warning (ticket)

Grafana SLO Dashboard

The SLO dashboard displays:

Panel	Description
SLO Compliance	Current compliance percentage vs. target
Error Budget Remaining	Percentage of error budget remaining
Burn Rate	Current burn rate vs. threshold
SLO History	30-day trend of SLO compliance
Budget Forecast	Projected budget exhaustion date

Custom Metrics Distributed Tracing