Skip to main content
Advanced Guide

Chaos Engineering

Power User

Inject controlled failures into your agents and workflows to prove they degrade gracefully — before they fail in production.

Chaos injection IS destructive testing

Never run chaos scenarios against a namespace holding real user data or production workloads. Use a dedicated test namespace, a fresh Neon branch, or a staging org. If you're unsure whether a namespace is safe to chaos-test, assume it isn't.

If you're the only admin on the org and the scenario targets auth or agent lifecycle, you can lock yourself out. Have a recovery plan before you start.

What is chaos engineering in ACE3?
Controlled failure injection for AI agents and workflows

Chaos engineering is the practice of injecting controlled failures into a running system to verify it degrades gracefully and recovers. Netflix pioneered it for microservices with Chaos Monkey. ACE3 brings it to AI agents: inject LLM timeouts, kill agents mid-workflow, corrupt event bus messages, then watch how the system handles it.

Unlike traditional chaos engineering (where failures are mostly network and service-level), agent chaos covers the AI-specific failure modes: provider rate limits, malformed LLM responses, token exhaustion, stale context, and mutation stress on the genome.

Current state: chaos tooling is early. A few named scenarios work end-to-end, the full catalog and blast-radius enforcement is being built. Plan #190 (chaos-engineering namespace) tracks the roadmap. Treat what's in this guide as the minimum viable surface — more scenarios land in upcoming releases.

The principles (steady-state hypothesis first)
Based on principlesofchaos.org

The #1 principle of chaos engineering: define a steady-state hypothesis BEFORE you inject anything. If you can't measure normal, you can't measure the impact.

1. Build a hypothesis around steady-state behaviour

Before injecting anything, write down what “normal” looks like: agent p95 response time, success rate, event bus throughput. Example: “Under normal load, my command center agent responds in <2s at p95 with >99% success.”

2. Vary real-world events

Simulate failures that actually happen in production — LLM rate limits, provider timeouts, connection pool exhaustion — not theoretical ones. Test what will break you, not what's clever.

3. Run experiments in production (eventually)

Mature chaos programmes graduate to running in production with tight blast radius. Start in a test namespace, then staging, then a dedicated chaos org. Never skip steps.

4. Automate experiments to run continuously

Weekly game days catch regressions faster than ad-hoc chaos sessions. Schedule chaos campaigns and feed the results back into the observer for pattern learning.

5. Minimise blast radius

Every scenario has hard limits: only this namespace, no more than N agents affected, auto-halt on SLO breach. Chaos must never escape its intended boundary.

Available scenarios
What you can inject today

The scenario catalog is being built out in phases (Plan #190). Current foundation includes:

ScenarioWhat it doesStatus
agent.kill_mid_executionStops an agent mid-run to test orchestrator recovery
Available
storage.failureSimulates DB unreachable for N seconds
Available
llm.timeout.30sLLM provider times out after 30s (tests fallback)
Coming
llm.rate_limit.burstProvider returns 429 for 60s
Coming
network.latency.500msAdds 500ms to every DB call
Coming
workflow.deadlockInjects a cyclic dependency in a running DAG
Coming
Injecting chaos via MCP
From Claude Code, Cursor, or any MCP-enabled client

Use ace_chaos_inject to run a scenario against a target namespace. Always pass a blast radius — the engine will refuse scenarios without one.

# Inject a storage failure for 30 seconds in my test namespace
ace_chaos_inject(
  scenario="storage.failure",
  namespace="test-chaos-ns",
  duration_s=30,
  blast_radius={
    "max_agents_affected": 3,
    "max_error_rate": 0.5
  }
)

# Read the report after the run completes
ace_chaos_report(run_id="<returned from inject>")

The engine measures baseline steady-state (60s), injects the chaos, measures under-chaos behaviour, then measures recovery. The report includes degradation percentage, recovery time, and any agents that failed to come back.

Blast radius & auto-rollback
Hard limits that prevent chaos from escaping

Every chaos scenario declares a blast radius. The engine refuses to run scenarios without one, and auto-halts a scenario that breaches its limits mid-run.

blast_radius = {
  "namespace": "test-chaos-ns",          # chaos confined to this namespace
  "max_agents_affected": 3,              # never hit more than 3 agents
  "max_duration_s": 300,                 # auto-terminate after 5 min
  "max_error_rate": 0.5,                 # halt if error rate exceeds 50%
  "environment": "neon_branch"           # NOT production, NOT staging
}

Enforcement rules

  • • A scenario touching any namespace outside the blast radius raises BlastRadiusViolation
  • • A scenario exceeding max_agents_affected raises BlastRadiusExceeded
  • • A scenario whose observed error rate exceeds max_error_rate auto-rolls back mid-run
  • • A scenario running longer than max_duration_s is killed and reported as timed out
Reading chaos reports
What the report tells you

After a chaos run, ace_chaos_report(run_id) returns a structured summary you can act on.

{
  "run_id": "chaos-2026-04-15-0001",
  "scenario": "storage.failure",
  "hypothesis": {
    "agent_response_p95_ms": 2000,
    "agent_success_rate": 0.99
  },
  "baseline":    { "p95_ms": 1850, "success_rate": 0.998 },
  "under_chaos": { "p95_ms": 4200, "success_rate": 0.83 },
  "recovery":    { "time_to_baseline_s": 18 },
  "hypothesis_held": false,
  "notes": "Success rate dropped below 0.90 threshold during 30s of storage failure"
}

If hypothesis_held is false, the system didn't behave as expected — either your hypothesis was wrong or you found a resilience bug. Either way, it's a learning event.

Recovery and rollback
What to do if something didn't come back

Well-behaved scenarios clean up after themselves — the chaos stops, agents recover, state returns to baseline. If something didn't recover:

  • Check ace_chaos_report for the agents-failed-to-recover list
  • Restart stuck agents via ace_agent_status(name, action="restart")
  • Roll back the genome if you injected genome.bad_mutation: use ace_genome_fork from a known-good snapshot
  • Worst case, archive the test namespace and create a fresh one — Neon branches make this near-instant
What NOT to do
Avoidable self-inflicted outages
  • Don't run chaos in production — graduate through Neon branch → staging-chaos org → production
  • Don't run chaos without a blast radius — the engine will refuse, but external scripts might not
  • Don't run chaos with real user data in the namespace — use a fresh test namespace
  • Don't run chaos if you're the only admin — you can lock yourself out of auth scenarios
  • Don't skip the hypothesis — chaos without a hypothesis is noise, not data
  • Don't ignore failed hypotheses — they're the whole point; file them as issues