Chaos Engineering

Chaos engineering validates MATIH's resilience by deliberately injecting failures into the production-like environment. The platform uses controlled experiments to verify that services degrade gracefully, alerts fire correctly, and recovery procedures work as expected.

Principles

Start small -- Begin with minor disruptions and increase severity
Define steady state -- Know what "normal" looks like before testing
Hypothesize -- State what you expect to happen before each experiment
Minimize blast radius -- Run experiments in non-production first
Automate -- Use tools to ensure reproducible experiments

Experiment Categories

Category	Description	Example
Pod failure	Kill or crash individual pods	Terminate an AI Service pod
Network partition	Block traffic between services	Isolate the database from services
Resource exhaustion	Consume CPU, memory, or disk	Fill a PVC to 95%
Dependency failure	Make an external dependency unavailable	Block Pinecone API access
Latency injection	Add artificial latency	Add 5s delay to database queries
DNS failure	Corrupt DNS resolution	Block DNS for a specific service

Tools

Tool	Description	Use Case
Chaos Mesh	Kubernetes-native chaos engineering	Pod, network, and I/O failures
Litmus	Kubernetes chaos experiments	Predefined experiment library
Gremlin	Enterprise chaos platform	Advanced failure scenarios
Custom scripts	Platform-specific tests	MATIH-specific scenarios

Experiment: Pod Failure

Hypothesis

When a single AI Service pod is terminated, the service continues to serve requests through remaining replicas with no user-visible impact.

Procedure

Verify steady state (all health checks pass, error rate is zero)
Terminate one AI Service pod
Observe Kubernetes recreating the pod
Verify requests are served by remaining replicas
Verify the ServiceHighRestartRate alert does not fire (single restart)
Verify the new pod becomes healthy

Expected Result

No 5xx errors observed
Request latency may briefly increase
New pod is healthy within 60 seconds

Experiment: Database Latency

Hypothesis

When PostgreSQL response time increases to 5 seconds, services degrade gracefully with timeout errors rather than cascading failures.

Procedure

Verify steady state
Inject 5-second latency on database connections
Observe service behavior and error rates
Verify circuit breakers activate
Remove latency injection
Verify services recover

Expected Result

Services return timeout errors (not crash)
Circuit breakers prevent connection pool exhaustion
Recovery is automatic when latency returns to normal

Experiment Schedule

Frequency	Experiment	Environment
Weekly	Pod failure (random service)	Staging
Monthly	Network partition (database)	Staging
Quarterly	Cross-region failover	Staging (with production-like data)
Annually	Full disaster recovery drill	Staging

Reporting

Each experiment generates a report:

Section	Content
Hypothesis	What was expected
Procedure	What was done
Observations	What actually happened
Alerts fired	Which alerts triggered (or should have)
Recovery time	Time to return to steady state
Action items	Improvements identified

Safety Guidelines

Never run chaos experiments in production without explicit approval
Always have a kill switch to immediately stop the experiment
Run experiments during business hours with the team available
Start with staging and only move to production after staging validation
Document every experiment and its results

Velero Backup Operator Health Check Overview