Platform Health
Platform health checks validate the overall state of the MATIH platform by checking all services, infrastructure components, and inter-service connectivity. The primary tool is the platform-status.sh script, which provides a single-command overview of the entire platform.
Platform Status Script
./scripts/tools/platform-status.shThis script checks:
| Check | Description |
|---|---|
| Pod status | All pods in MATIH namespaces are Running |
| Service endpoints | All services have healthy endpoints |
| Resource usage | CPU and memory within thresholds |
| PVC status | All persistent volume claims are Bound |
| Recent restarts | No unexpected pod restarts |
| Deployment status | All deployments have desired replicas |
Health Check Script
./scripts/disaster-recovery/health-check.shThe health check script performs deeper validation:
| Check | Description |
|---|---|
| Service health endpoints | HTTP health check for each service |
| Database connectivity | PostgreSQL connection test |
| Cache connectivity | Redis ping test |
| Message queue | Kafka broker availability |
| DNS resolution | Internal service DNS resolution |
| Certificate validity | TLS certificate expiry check |
Port Validation
./scripts/tools/validate-ports.shValidates that all service ports match the configuration in scripts/config/components.yaml, detecting any port drift between configuration and deployment.
Tenant Ingress Validation
./scripts/tools/validate-tenant-ingress.sh --tenant acmeValidates the tenant-specific ingress configuration:
| Check | Description |
|---|---|
| Ingress resource | Ingress exists for the tenant |
| TLS certificate | Certificate is valid and not expired |
| DNS resolution | Tenant domain resolves correctly |
| Backend service | Backend service is reachable |
| HTTP response | Returns 200 for health endpoint |
Automated Health Checks
Platform health checks run automatically:
| Schedule | Check | Alert on Failure |
|---|---|---|
| Every 15s | Kubernetes liveness/readiness probes | Pod restart |
| Every 30s | Prometheus scrape | ServiceDown alert |
| Every 5m | Platform status (synthetic) | Slack notification |
| Daily | Full health check | Email report |
Health Dashboard
The Grafana Platform Overview dashboard provides a visual health summary:
| Panel | Description |
|---|---|
| Service Status Matrix | Green/red grid of all services |
| Pod Readiness | Percentage of pods in Ready state |
| Recent Alerts | Currently firing alerts |
| Resource Utilization | CPU and memory heatmap |