MATIH Platform is in active MVP development. Documentation reflects current implementation status.

Alert Rules Library

MATIH maintains a comprehensive library of Prometheus alerting rules organized by category. Each alert includes severity classification, meaningful annotations with runbook links, and dashboard URLs for quick investigation.

Alert Categories

Category	Description	Rule File
Service Health	Service availability and resource usage	`matih-alerts.yml`
HTTP Errors	Request error rates and latency	`matih-alerts.yml`
Provisioning	Tenant provisioning pipeline	`provisioning.yaml`
SLO	Service level objective compliance	`slo.yaml`
Infrastructure	Kubernetes and database health	`infrastructure.yaml`

Service Health Alerts

Alert	Condition	Duration	Severity
`ServiceDown`	`up == 0`	2 minutes	critical
`ServiceHighRestartRate`	More than 5 restarts per hour	5 minutes	warning
`HighMemoryUsage`	Memory above 85% of limit	5 minutes	warning
`HighCPUUsage`	CPU rate above 90%	10 minutes	warning

HTTP Error Alerts

Alert	Condition	Duration	Severity
`HighErrorRate`	5xx rate above 5%	5 minutes	critical
`High4xxRate`	4xx rate above 20%	5 minutes	warning
`SlowRequests`	p95 latency above 5 seconds	5 minutes	warning

Provisioning Alerts

Alert	Condition	Duration	Severity
`ProvisioningFailureRateHigh`	Failure rate above 10% in 15 minutes	5 minutes	critical
`ProvisioningStuckInProgress`	More than 10 active for 30 minutes	30 minutes	warning
`ProvisioningConsecutiveFailures`	5+ failures with 0 successes in 10 minutes	5 minutes	critical
`ProvisioningStepSlowdown`	p90 step duration above 120 seconds	10 minutes	warning

SLO Alerts

Alert	Condition	Duration	Severity
`SLOErrorBudgetBurnRateHigh`	Burn rate exceeds 14.4x (1-hour window)	2 minutes	critical
`SLOLatencyBudgetLow`	Latency compliance below 99%	10 minutes	warning

Infrastructure Alerts

Alert	Condition	Duration	Severity
`PostgreSQLDown`	Database unreachable	1 minute	critical
`KafkaBrokerDown`	Kafka broker offline	2 minutes	critical
`HighDiskUsage`	Disk above 85%	15 minutes	warning
`CertificateExpiringSoon`	TLS certificate expires in under 14 days	1 hour	warning
`PVCNearlyFull`	PVC above 90% capacity	10 minutes	warning

Alert Annotation Standards

Every alert must include:

Annotation	Description	Required
`summary`	Brief one-line description	Yes
`description`	Detailed description with metric values	Yes
`runbook_url`	Link to the operational runbook	Yes
`dashboard_url`	Link to the relevant Grafana dashboard	Recommended

Adding New Alerts

Define the alert in a PrometheusRule CRD or YAML file
Include all required annotations
Set appropriate severity and category labels
Test with promtool check rules rules-file.yaml
Deploy by applying the PrometheusRule to the monitoring namespace

Alertmanager Setup Notification Channels