MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
SLO Monitoring

SLO Monitoring

MATIH tracks Service Level Objectives (SLOs) using Prometheus recording rules and Grafana dashboards. SLOs define the acceptable thresholds for availability, latency, and error rates, with error budget tracking to guide release decisions.


SLO Definitions

SLOTargetMeasurement Window
Availability99.9%30-day rolling
Latency (p99)Under 1 second5-minute window
Error RateUnder 1%5-minute window
Provisioning Success95%24-hour rolling

SLI Metrics

Service Level Indicators (SLIs) are the raw metrics used to compute SLO compliance:

SLIMetricGood EventTotal Event
Availabilityupup == 1All scrape attempts
Request Successmatih_http_requests_totalNon-5xx responsesAll responses
Latencymatih_http_request_duration_secondsRequests under 1sAll requests
Provisioningmatih_provisioning_completed_totalSuccessful completionsAll attempts

Recording Rules

groups:
  - name: slo-recording
    interval: 30s
    rules:
      - record: matih:slo:availability:ratio
        expr: avg_over_time(up{job=~"matih.*"}[30d])
 
      - record: matih:slo:error_rate:5m
        expr: |
          sum(rate(matih_http_requests_total{status_class="5xx"}[5m]))
          /
          sum(rate(matih_http_requests_total[5m]))
 
      - record: matih:slo:latency_under_1s:ratio
        expr: |
          sum(rate(matih_http_request_duration_seconds_bucket{le="1.0"}[5m]))
          /
          sum(rate(matih_http_request_duration_seconds_count[5m]))

Error Budget

The error budget represents the allowed amount of downtime or errors within the SLO window:

error_budget = 1 - SLO_target
budget_consumed = 1 - (good_events / total_events)
budget_remaining = error_budget - budget_consumed

For a 99.9% availability SLO over 30 days:

  • Error budget: 0.1% = 43.2 minutes of downtime
  • If 20 minutes of downtime have occurred: 53.7% of budget consumed

SLO Alerting Rules

groups:
  - name: slo-alerts
    rules:
      - alert: SLOErrorBudgetBurnRateHigh
        expr: |
          matih:slo:error_rate:5m > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burn rate is too high"
          description: "Current error rate will exhaust the 30-day error budget in under 1 hour."
 
      - alert: SLOLatencyBudgetLow
        expr: |
          matih:slo:latency_under_1s:ratio < 0.99
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Latency SLO compliance is below target"

Multi-Window Burn Rate

MATIH implements the multi-window, multi-burn-rate alerting strategy recommended by Google SRE:

WindowBurn RateAlert AfterSeverity
1 hour14.4x2 minutesCritical (page)
6 hours6x5 minutesCritical (page)
1 day3x30 minutesWarning (ticket)
3 days1x6 hoursWarning (ticket)

Grafana SLO Dashboard

The SLO dashboard displays:

PanelDescription
SLO ComplianceCurrent compliance percentage vs. target
Error Budget RemainingPercentage of error budget remaining
Burn RateCurrent burn rate vs. threshold
SLO History30-day trend of SLO compliance
Budget ForecastProjected budget exhaustion date