Engineering
chaos-engineering-resilience avatar

chaos-engineering-resilience

Execute controlled failure injection, resilience testing, and system recovery validation to build confidence in distributed system fault tolerance.

Introduction

The chaos-engineering-resilience skill empowers engineers to proactively harden distributed systems by intentionally injecting controlled failures. By applying the principles of chaos engineering, this skill helps identify hidden architectural weaknesses, validate disaster recovery runbooks, and verify that automated monitoring and alerting systems perform correctly under duress. It is designed for platform engineers, site reliability engineers (SREs), and quality assurance teams seeking to move beyond functional testing into true production-grade resilience verification.

  • Automated failure injection for network conditions (latency, packet loss, partitions), infrastructure (instance termination, disk failure, CPU stress), and application-level faults (exceptions, dependency timeouts).

  • Sophisticated steady-state monitoring using custom metrics for error rate, throughput, and p99 latency to detect deviations during experiments.

  • Integrated safety mechanisms including automatic rollback triggers when pre-defined error thresholds are exceeded, ensuring a controlled blast radius.

  • Intelligent orchestration through the qe-chaos-engineer agent, which manages the entire experiment lifecycle from steady-state baseline establishment to final impact analysis.

  • Automated runbook generation based on experiment results, capturing system recovery patterns and post-incident documentation.

  • Seamless integration with performance-testing and production-intelligence agents for holistic system assessment.

  • Users should define a clear steady state before launching any experiment; the skill requires baseline measurements to distinguish between normal behavior and actual failure impact.

  • Always start in non-production environments like Dev or Staging before executing experiments in production; utilize the blast radius progression from 1% to 100% capacity.

  • Expected outputs include experiment definition files (JSON/TypeScript), real-time experiment logs, and post-mortem analysis reports including identified system weaknesses.

  • Supports popular chaos tools such as tc, toxiproxy, Chaos Monkey, Gremlin, and LitmusChaos within a centralized agent coordination framework.

  • Strictly adheres to safety-first principles, requiring mandatory rollback triggers like error_rate > 5% for all production-scoped interventions.

Repository Stats

Stars
329
Forks
65
Open Issues
4
Language
TypeScript
Default Branch
main
Sync Status
Idle
Last Synced
Apr 29, 2026, 07:59 AM
View on GitHub