MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Health Checks
Platform Health

Platform Health

Platform health checks validate the overall state of the MATIH platform by checking all services, infrastructure components, and inter-service connectivity. The primary tool is the platform-status.sh script, which provides a single-command overview of the entire platform.


Platform Status Script

./scripts/tools/platform-status.sh

This script checks:

CheckDescription
Pod statusAll pods in MATIH namespaces are Running
Service endpointsAll services have healthy endpoints
Resource usageCPU and memory within thresholds
PVC statusAll persistent volume claims are Bound
Recent restartsNo unexpected pod restarts
Deployment statusAll deployments have desired replicas

Health Check Script

./scripts/disaster-recovery/health-check.sh

The health check script performs deeper validation:

CheckDescription
Service health endpointsHTTP health check for each service
Database connectivityPostgreSQL connection test
Cache connectivityRedis ping test
Message queueKafka broker availability
DNS resolutionInternal service DNS resolution
Certificate validityTLS certificate expiry check

Port Validation

./scripts/tools/validate-ports.sh

Validates that all service ports match the configuration in scripts/config/components.yaml, detecting any port drift between configuration and deployment.


Tenant Ingress Validation

./scripts/tools/validate-tenant-ingress.sh --tenant acme

Validates the tenant-specific ingress configuration:

CheckDescription
Ingress resourceIngress exists for the tenant
TLS certificateCertificate is valid and not expired
DNS resolutionTenant domain resolves correctly
Backend serviceBackend service is reachable
HTTP responseReturns 200 for health endpoint

Automated Health Checks

Platform health checks run automatically:

ScheduleCheckAlert on Failure
Every 15sKubernetes liveness/readiness probesPod restart
Every 30sPrometheus scrapeServiceDown alert
Every 5mPlatform status (synthetic)Slack notification
DailyFull health checkEmail report

Health Dashboard

The Grafana Platform Overview dashboard provides a visual health summary:

PanelDescription
Service Status MatrixGreen/red grid of all services
Pod ReadinessPercentage of pods in Ready state
Recent AlertsCurrently firing alerts
Resource UtilizationCPU and memory heatmap