Engineering
chaos-engineering-resilience avatar

chaos-engineering-resilience

Chaos engineering framework for injecting controlled failures, validating system resilience, and automating disaster recovery testing in distributed environments.

Introduction

The chaos-engineering-resilience skill provides a robust framework for testing the fault tolerance and reliability of distributed systems through controlled, agent-driven failure injection. Designed for SREs, DevOps engineers, and Quality Engineers, this skill enables teams to move beyond passive monitoring by actively breaking systems to uncover hidden weaknesses before they impact production. It follows the rigorous principles of chaos experimentation: defining steady-state metrics, formulating failure hypotheses, executing real-world failure injections, and validating system recovery mechanisms. By integrating this skill, agents can coordinate a specialized fleet—including the qe-chaos-engineer, qe-performance-tester, and qe-production-intelligence—to conduct experiments with safety-first protocols like automatic rollbacks and defined blast radii.

  • Automated failure injection targeting network latency, packet loss, instance termination, disk failure, CPU stress, and service-level dependencies.

  • Support for industry-standard tools including tc, toxiproxy, Chaos Monkey, Gremlin, and LitmusChaos.

  • Intelligent blast radius management with gradual progression from development and staging environments to production subsets (1%, 10%, 50%, 100%).

  • Real-time observation of steady-state metrics such as error rates, p99 latency, and throughput to trigger automatic rollbacks when thresholds are breached.

  • Automated generation of incident response runbooks based on observed failure recovery patterns and system behavior.

  • Seamless integration with distributed systems architectures and cloud-native infrastructure for comprehensive resilience validation.

  • Always start experiments in non-production environments to establish baselines before graduating to live traffic.

  • Ensure steady-state metrics (normal operational behavior) are clearly defined and measurable prior to any injection.

  • Use the experiment structure to document specific hypotheses and expected outcomes for every test run.

  • Monitor blast radius closely and utilize the automatic rollback features to prevent unplanned outages.

  • Regularly update the memory namespace with new runbooks and baseline metrics to improve the agent's contextual awareness over time.

Repository Stats

Stars
329
Forks
65
Open Issues
4
Language
TypeScript
Default Branch
main
Sync Status
Idle
Last Synced
Apr 29, 2026, 01:29 AM
View on GitHub