Platform Health

Platform health checks validate the overall state of the MATIH platform by checking all services, infrastructure components, and inter-service connectivity. The primary tool is the platform-status.sh script, which provides a single-command overview of the entire platform.

Platform Status Script

./scripts/tools/platform-status.sh

This script checks:

Check	Description
Pod status	All pods in MATIH namespaces are Running
Service endpoints	All services have healthy endpoints
Resource usage	CPU and memory within thresholds
PVC status	All persistent volume claims are Bound
Recent restarts	No unexpected pod restarts
Deployment status	All deployments have desired replicas

Health Check Script

./scripts/disaster-recovery/health-check.sh

The health check script performs deeper validation:

Check	Description
Service health endpoints	HTTP health check for each service
Database connectivity	PostgreSQL connection test
Cache connectivity	Redis ping test
Message queue	Kafka broker availability
DNS resolution	Internal service DNS resolution
Certificate validity	TLS certificate expiry check

Port Validation

./scripts/tools/validate-ports.sh

Validates that all service ports match the configuration in scripts/config/components.yaml, detecting any port drift between configuration and deployment.

Tenant Ingress Validation

./scripts/tools/validate-tenant-ingress.sh --tenant acme

Validates the tenant-specific ingress configuration:

Check	Description
Ingress resource	Ingress exists for the tenant
TLS certificate	Certificate is valid and not expired
DNS resolution	Tenant domain resolves correctly
Backend service	Backend service is reachable
HTTP response	Returns 200 for health endpoint

Automated Health Checks

Platform health checks run automatically:

Schedule	Check	Alert on Failure
Every 15s	Kubernetes liveness/readiness probes	Pod restart
Every 30s	Prometheus scrape	`ServiceDown` alert
Every 5m	Platform status (synthetic)	Slack notification
Daily	Full health check	Email report

Health Dashboard

The Grafana Platform Overview dashboard provides a visual health summary:

Panel	Description
Service Status Matrix	Green/red grid of all services
Pod Readiness	Percentage of pods in Ready state
Recent Alerts	Currently firing alerts
Resource Utilization	CPU and memory heatmap

Service Health Dependency Checks