Chaos Engineering
Chaos engineering validates MATIH's resilience by deliberately injecting failures into the production-like environment. The platform uses controlled experiments to verify that services degrade gracefully, alerts fire correctly, and recovery procedures work as expected.
Principles
- Start small -- Begin with minor disruptions and increase severity
- Define steady state -- Know what "normal" looks like before testing
- Hypothesize -- State what you expect to happen before each experiment
- Minimize blast radius -- Run experiments in non-production first
- Automate -- Use tools to ensure reproducible experiments
Experiment Categories
| Category | Description | Example |
|---|---|---|
| Pod failure | Kill or crash individual pods | Terminate an AI Service pod |
| Network partition | Block traffic between services | Isolate the database from services |
| Resource exhaustion | Consume CPU, memory, or disk | Fill a PVC to 95% |
| Dependency failure | Make an external dependency unavailable | Block Pinecone API access |
| Latency injection | Add artificial latency | Add 5s delay to database queries |
| DNS failure | Corrupt DNS resolution | Block DNS for a specific service |
Tools
| Tool | Description | Use Case |
|---|---|---|
| Chaos Mesh | Kubernetes-native chaos engineering | Pod, network, and I/O failures |
| Litmus | Kubernetes chaos experiments | Predefined experiment library |
| Gremlin | Enterprise chaos platform | Advanced failure scenarios |
| Custom scripts | Platform-specific tests | MATIH-specific scenarios |
Experiment: Pod Failure
Hypothesis
When a single AI Service pod is terminated, the service continues to serve requests through remaining replicas with no user-visible impact.
Procedure
- Verify steady state (all health checks pass, error rate is zero)
- Terminate one AI Service pod
- Observe Kubernetes recreating the pod
- Verify requests are served by remaining replicas
- Verify the
ServiceHighRestartRatealert does not fire (single restart) - Verify the new pod becomes healthy
Expected Result
- No 5xx errors observed
- Request latency may briefly increase
- New pod is healthy within 60 seconds
Experiment: Database Latency
Hypothesis
When PostgreSQL response time increases to 5 seconds, services degrade gracefully with timeout errors rather than cascading failures.
Procedure
- Verify steady state
- Inject 5-second latency on database connections
- Observe service behavior and error rates
- Verify circuit breakers activate
- Remove latency injection
- Verify services recover
Expected Result
- Services return timeout errors (not crash)
- Circuit breakers prevent connection pool exhaustion
- Recovery is automatic when latency returns to normal
Experiment Schedule
| Frequency | Experiment | Environment |
|---|---|---|
| Weekly | Pod failure (random service) | Staging |
| Monthly | Network partition (database) | Staging |
| Quarterly | Cross-region failover | Staging (with production-like data) |
| Annually | Full disaster recovery drill | Staging |
Reporting
Each experiment generates a report:
| Section | Content |
|---|---|
| Hypothesis | What was expected |
| Procedure | What was done |
| Observations | What actually happened |
| Alerts fired | Which alerts triggered (or should have) |
| Recovery time | Time to return to steady state |
| Action items | Improvements identified |
Safety Guidelines
- Never run chaos experiments in production without explicit approval
- Always have a kill switch to immediately stop the experiment
- Run experiments during business hours with the team available
- Start with staging and only move to production after staging validation
- Document every experiment and its results