MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Service Restart

Service Restart

This runbook covers safely restarting MATIH services with appropriate pre-checks, restart procedures, and post-restart verification.


Symptoms

  • ServiceDown alert firing for a specific service
  • ServiceHighRestartRate alert indicating instability
  • Service returning 5xx errors consistently
  • CrashLoopBackOff pod status

Impact

Restarting a service causes a brief interruption to requests routed to the restarting pod(s). With multiple replicas, rolling restarts minimize impact.


Prerequisites

  • Access to the MATIH platform scripts
  • Knowledge of which service needs restarting
  • Understanding of current platform status

Steps

1. Assess Current State

./scripts/tools/platform-status.sh

Review the output to understand which services are healthy and which are affected.

2. Check Service Health

./scripts/disaster-recovery/health-check.sh

Identify the specific error or failure pattern.

3. Check Recent Changes

Review recent deployments or configuration changes that may have caused the issue. Check the git log for recent commits to the affected service.

4. Perform Rolling Restart

Use the service deployment script to redeploy:

./scripts/tools/service-build-deploy.sh <service-name>

This performs a rolling restart that maintains availability during the restart.

5. Full Rebuild (if needed)

If a rolling restart does not resolve the issue:

./scripts/tools/full-service-rebuild.sh <service-name>

Verification

After the restart:

  1. Run the health check script to verify the service is healthy
  2. Check the Grafana dashboard for the service to confirm metrics are normal
  3. Verify the alert has resolved in Alertmanager
  4. Monitor for 15 minutes to ensure stability
./scripts/disaster-recovery/health-check.sh

Escalation

If the service continues to fail after restart:

  1. Check pod logs for the specific error
  2. Review the deployment YAML for misconfigurations
  3. Check if dependent services (database, Kafka, Redis) are healthy
  4. Escalate to the service owner team