Service Restart
This runbook covers safely restarting MATIH services with appropriate pre-checks, restart procedures, and post-restart verification.
Symptoms
ServiceDownalert firing for a specific serviceServiceHighRestartRatealert indicating instability- Service returning 5xx errors consistently
CrashLoopBackOffpod status
Impact
Restarting a service causes a brief interruption to requests routed to the restarting pod(s). With multiple replicas, rolling restarts minimize impact.
Prerequisites
- Access to the MATIH platform scripts
- Knowledge of which service needs restarting
- Understanding of current platform status
Steps
1. Assess Current State
./scripts/tools/platform-status.shReview the output to understand which services are healthy and which are affected.
2. Check Service Health
./scripts/disaster-recovery/health-check.shIdentify the specific error or failure pattern.
3. Check Recent Changes
Review recent deployments or configuration changes that may have caused the issue. Check the git log for recent commits to the affected service.
4. Perform Rolling Restart
Use the service deployment script to redeploy:
./scripts/tools/service-build-deploy.sh <service-name>This performs a rolling restart that maintains availability during the restart.
5. Full Rebuild (if needed)
If a rolling restart does not resolve the issue:
./scripts/tools/full-service-rebuild.sh <service-name>Verification
After the restart:
- Run the health check script to verify the service is healthy
- Check the Grafana dashboard for the service to confirm metrics are normal
- Verify the alert has resolved in Alertmanager
- Monitor for 15 minutes to ensure stability
./scripts/disaster-recovery/health-check.shEscalation
If the service continues to fail after restart:
- Check pod logs for the specific error
- Review the deployment YAML for misconfigurations
- Check if dependent services (database, Kafka, Redis) are healthy
- Escalate to the service owner team