Skip to main content

/chaos-test - Chaos Engineering Experiment Execution

Execute controlled chaos engineering experiments to validate system resilience. Defines steady-state hypotheses, injects faults across network, resource, application, and infrastructure layers, monitors impact, and generates resilience reports.

System Prompt

EXECUTION DIRECTIVE: When the user invokes this command, you MUST:

  1. IMMEDIATELY execute - no questions first
  2. Load the skill at skills/chaos-engineering-patterns/SKILL.md
  3. Verify confirmation - this command injects faults and requires explicit user confirmation
  4. Define steady-state hypothesis - establish baseline metrics and success criteria
  5. Select fault type - network, resource, application, or infrastructure
  6. Inject controlled fault - apply fault to target service/pod for specified duration
  7. Monitor impact - track metrics, logs, and system behavior during fault injection
  8. Generate resilience report - document experiment results, blast radius, and recovery behavior
  9. Abort mechanism - provide emergency stop if fault causes unexpected cascade

Usage

# Basic experiment
/chaos-test --experiment network-partition --target api-service --duration 60s

# Specific fault type
/chaos-test --fault-type network --target payments-pod --duration 120s

# Preview without injection
/chaos-test --experiment cpu-stress --dry-run

# Emergency abort
/chaos-test --abort

# Full GameDay mode
/chaos-test --gameday --experiment black-friday-simulation

Options

OptionDescriptionDefault
--experimentExperiment name or file pathInteractive selection
--fault-typeFault category: network|resource|application|infrastructureAuto-detect from experiment
--targetTarget service/pod/deploymentAll services
--durationFault injection duration (e.g., 60s, 5m)60s
--dry-runPreview experiment without fault injectionfalse
--abortEmergency stop ongoing experimentN/A
--gamedayFull GameDay mode with team coordinationfalse
  • /canary-check - Validate canary deployments before chaos testing
  • /smoke-test - Run smoke tests to establish baseline before chaos
  • /postmortem - Generate postmortem after chaos experiment reveals issues

Success Output

✅ Chaos Experiment Complete
━━━━━━━━━━━━━━━━━━━━━━━━━━
Experiment: network-partition-api
Target: payments-service
Duration: 120s
Fault Type: Network (latency +500ms)

📊 Steady-State Metrics
- Pre-fault: 99.95% success rate, p99 200ms
- During fault: 98.2% success rate, p99 850ms
- Post-fault: 99.94% success rate, p99 210ms

🔍 Resilience Findings
✅ Circuit breaker activated after 15s
✅ Failover to backup service successful
⚠️ Recovery took 45s (target: 30s)
❌ 3 customer transactions failed (retry failed)

📝 Report: chaos-reports/network-partition-api-2026-02-01.md

Completion Checklist

  • Steady-state hypothesis defined
  • Baseline metrics captured
  • Fault injected to target service
  • Impact monitored in real-time
  • Blast radius contained
  • System recovered to steady-state
  • Resilience report generated
  • Action items identified for improvement

Failure Indicators

  • Uncontrolled cascade - fault spreads beyond expected blast radius
  • No recovery - system does not return to steady-state after fault removal
  • Monitoring gap - unable to measure impact metrics
  • Abort failed - emergency stop mechanism does not work

When NOT to Use

  • ❌ Production systems without GameDay approval and stakeholder notification
  • ❌ Systems with known critical issues or ongoing incidents
  • ❌ Services without circuit breakers or resilience patterns
  • ❌ Environments lacking monitoring and observability

Anti-Patterns

  • ❌ Running chaos experiments without steady-state hypothesis
  • ❌ Injecting faults without abort mechanism
  • ❌ Testing in production without runbook or incident response plan
  • ❌ Ignoring blast radius - causing unintended system-wide outages

Principles

  • #3 Complete Execution - Execute full experiment lifecycle (hypothesis → inject → monitor → report)
  • #9 Based on Facts - Use metrics and observability data to measure resilience

Full Standard: CODITECT-STANDARD-AUTOMATION.md