MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Disaster Recovery
Chaos Engineering

Chaos Engineering

Chaos engineering validates MATIH's resilience by deliberately injecting failures into the production-like environment. The platform uses controlled experiments to verify that services degrade gracefully, alerts fire correctly, and recovery procedures work as expected.


Principles

  1. Start small -- Begin with minor disruptions and increase severity
  2. Define steady state -- Know what "normal" looks like before testing
  3. Hypothesize -- State what you expect to happen before each experiment
  4. Minimize blast radius -- Run experiments in non-production first
  5. Automate -- Use tools to ensure reproducible experiments

Experiment Categories

CategoryDescriptionExample
Pod failureKill or crash individual podsTerminate an AI Service pod
Network partitionBlock traffic between servicesIsolate the database from services
Resource exhaustionConsume CPU, memory, or diskFill a PVC to 95%
Dependency failureMake an external dependency unavailableBlock Pinecone API access
Latency injectionAdd artificial latencyAdd 5s delay to database queries
DNS failureCorrupt DNS resolutionBlock DNS for a specific service

Tools

ToolDescriptionUse Case
Chaos MeshKubernetes-native chaos engineeringPod, network, and I/O failures
LitmusKubernetes chaos experimentsPredefined experiment library
GremlinEnterprise chaos platformAdvanced failure scenarios
Custom scriptsPlatform-specific testsMATIH-specific scenarios

Experiment: Pod Failure

Hypothesis

When a single AI Service pod is terminated, the service continues to serve requests through remaining replicas with no user-visible impact.

Procedure

  1. Verify steady state (all health checks pass, error rate is zero)
  2. Terminate one AI Service pod
  3. Observe Kubernetes recreating the pod
  4. Verify requests are served by remaining replicas
  5. Verify the ServiceHighRestartRate alert does not fire (single restart)
  6. Verify the new pod becomes healthy

Expected Result

  • No 5xx errors observed
  • Request latency may briefly increase
  • New pod is healthy within 60 seconds

Experiment: Database Latency

Hypothesis

When PostgreSQL response time increases to 5 seconds, services degrade gracefully with timeout errors rather than cascading failures.

Procedure

  1. Verify steady state
  2. Inject 5-second latency on database connections
  3. Observe service behavior and error rates
  4. Verify circuit breakers activate
  5. Remove latency injection
  6. Verify services recover

Expected Result

  • Services return timeout errors (not crash)
  • Circuit breakers prevent connection pool exhaustion
  • Recovery is automatic when latency returns to normal

Experiment Schedule

FrequencyExperimentEnvironment
WeeklyPod failure (random service)Staging
MonthlyNetwork partition (database)Staging
QuarterlyCross-region failoverStaging (with production-like data)
AnnuallyFull disaster recovery drillStaging

Reporting

Each experiment generates a report:

SectionContent
HypothesisWhat was expected
ProcedureWhat was done
ObservationsWhat actually happened
Alerts firedWhich alerts triggered (or should have)
Recovery timeTime to return to steady state
Action itemsImprovements identified

Safety Guidelines

  • Never run chaos experiments in production without explicit approval
  • Always have a kill switch to immediately stop the experiment
  • Run experiments during business hours with the team available
  • Start with staging and only move to production after staging validation
  • Document every experiment and its results