Service Restart

This runbook covers safely restarting MATIH services with appropriate pre-checks, restart procedures, and post-restart verification.

Symptoms

ServiceDown alert firing for a specific service
ServiceHighRestartRate alert indicating instability
Service returning 5xx errors consistently
CrashLoopBackOff pod status

Impact

Restarting a service causes a brief interruption to requests routed to the restarting pod(s). With multiple replicas, rolling restarts minimize impact.

Prerequisites

Access to the MATIH platform scripts
Knowledge of which service needs restarting
Understanding of current platform status

Steps

1. Assess Current State

./scripts/tools/platform-status.sh

Review the output to understand which services are healthy and which are affected.

2. Check Service Health

./scripts/disaster-recovery/health-check.sh

Identify the specific error or failure pattern.

3. Check Recent Changes

Review recent deployments or configuration changes that may have caused the issue. Check the git log for recent commits to the affected service.

4. Perform Rolling Restart

Use the service deployment script to redeploy:

./scripts/tools/service-build-deploy.sh <service-name>

This performs a rolling restart that maintains availability during the restart.

5. Full Rebuild (if needed)

If a rolling restart does not resolve the issue:

./scripts/tools/full-service-rebuild.sh <service-name>

Verification

After the restart:

Run the health check script to verify the service is healthy
Check the Grafana dashboard for the service to confirm metrics are normal
Verify the alert has resolved in Alertmanager
Monitor for 15 minutes to ensure stability

./scripts/disaster-recovery/health-check.sh

Escalation

If the service continues to fail after restart:

Check pod logs for the specific error
Review the deployment YAML for misconfigurations
Check if dependent services (database, Kafka, Redis) are healthy
Escalate to the service owner team

Runbooks Overview Database Recovery