MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Alert Rules Library

Alert Rules Library

MATIH maintains a comprehensive library of Prometheus alerting rules organized by category. Each alert includes severity classification, meaningful annotations with runbook links, and dashboard URLs for quick investigation.


Alert Categories

CategoryDescriptionRule File
Service HealthService availability and resource usagematih-alerts.yml
HTTP ErrorsRequest error rates and latencymatih-alerts.yml
ProvisioningTenant provisioning pipelineprovisioning.yaml
SLOService level objective complianceslo.yaml
InfrastructureKubernetes and database healthinfrastructure.yaml

Service Health Alerts

AlertConditionDurationSeverity
ServiceDownup == 02 minutescritical
ServiceHighRestartRateMore than 5 restarts per hour5 minuteswarning
HighMemoryUsageMemory above 85% of limit5 minuteswarning
HighCPUUsageCPU rate above 90%10 minuteswarning

HTTP Error Alerts

AlertConditionDurationSeverity
HighErrorRate5xx rate above 5%5 minutescritical
High4xxRate4xx rate above 20%5 minuteswarning
SlowRequestsp95 latency above 5 seconds5 minuteswarning

Provisioning Alerts

AlertConditionDurationSeverity
ProvisioningFailureRateHighFailure rate above 10% in 15 minutes5 minutescritical
ProvisioningStuckInProgressMore than 10 active for 30 minutes30 minuteswarning
ProvisioningConsecutiveFailures5+ failures with 0 successes in 10 minutes5 minutescritical
ProvisioningStepSlowdownp90 step duration above 120 seconds10 minuteswarning

SLO Alerts

AlertConditionDurationSeverity
SLOErrorBudgetBurnRateHighBurn rate exceeds 14.4x (1-hour window)2 minutescritical
SLOLatencyBudgetLowLatency compliance below 99%10 minuteswarning

Infrastructure Alerts

AlertConditionDurationSeverity
PostgreSQLDownDatabase unreachable1 minutecritical
KafkaBrokerDownKafka broker offline2 minutescritical
HighDiskUsageDisk above 85%15 minuteswarning
CertificateExpiringSoonTLS certificate expires in under 14 days1 hourwarning
PVCNearlyFullPVC above 90% capacity10 minuteswarning

Alert Annotation Standards

Every alert must include:

AnnotationDescriptionRequired
summaryBrief one-line descriptionYes
descriptionDetailed description with metric valuesYes
runbook_urlLink to the operational runbookYes
dashboard_urlLink to the relevant Grafana dashboardRecommended

Adding New Alerts

  1. Define the alert in a PrometheusRule CRD or YAML file
  2. Include all required annotations
  3. Set appropriate severity and category labels
  4. Test with promtool check rules rules-file.yaml
  5. Deploy by applying the PrometheusRule to the monitoring namespace