Chaos Engineering
Inject controlled failures into your agents and workflows to prove they degrade gracefully — before they fail in production.
Chaos injection IS destructive testing
Never run chaos scenarios against a namespace holding real user data or production workloads. Use a dedicated test namespace, a fresh Neon branch, or a staging org. If you're unsure whether a namespace is safe to chaos-test, assume it isn't.
If you're the only admin on the org and the scenario targets auth or agent lifecycle, you can lock yourself out. Have a recovery plan before you start.
Chaos engineering is the practice of injecting controlled failures into a running system to verify it degrades gracefully and recovers. Netflix pioneered it for microservices with Chaos Monkey. ACE3 brings it to AI agents: inject LLM timeouts, kill agents mid-workflow, corrupt event bus messages, then watch how the system handles it.
Unlike traditional chaos engineering (where failures are mostly network and service-level), agent chaos covers the AI-specific failure modes: provider rate limits, malformed LLM responses, token exhaustion, stale context, and mutation stress on the genome.
Current state: chaos tooling is early. A few named scenarios work end-to-end, the full catalog and blast-radius enforcement is being built. Plan #190 (chaos-engineering namespace) tracks the roadmap. Treat what's in this guide as the minimum viable surface — more scenarios land in upcoming releases.
The #1 principle of chaos engineering: define a steady-state hypothesis BEFORE you inject anything. If you can't measure normal, you can't measure the impact.
1. Build a hypothesis around steady-state behaviour
Before injecting anything, write down what “normal” looks like: agent p95 response time, success rate, event bus throughput. Example: “Under normal load, my command center agent responds in <2s at p95 with >99% success.”
2. Vary real-world events
Simulate failures that actually happen in production — LLM rate limits, provider timeouts, connection pool exhaustion — not theoretical ones. Test what will break you, not what's clever.
3. Run experiments in production (eventually)
Mature chaos programmes graduate to running in production with tight blast radius. Start in a test namespace, then staging, then a dedicated chaos org. Never skip steps.
4. Automate experiments to run continuously
Weekly game days catch regressions faster than ad-hoc chaos sessions. Schedule chaos campaigns and feed the results back into the observer for pattern learning.
5. Minimise blast radius
Every scenario has hard limits: only this namespace, no more than N agents affected, auto-halt on SLO breach. Chaos must never escape its intended boundary.
The scenario catalog is being built out in phases (Plan #190). Current foundation includes:
| Scenario | What it does | Status |
|---|---|---|
agent.kill_mid_execution | Stops an agent mid-run to test orchestrator recovery | Available |
storage.failure | Simulates DB unreachable for N seconds | Available |
llm.timeout.30s | LLM provider times out after 30s (tests fallback) | Coming |
llm.rate_limit.burst | Provider returns 429 for 60s | Coming |
network.latency.500ms | Adds 500ms to every DB call | Coming |
workflow.deadlock | Injects a cyclic dependency in a running DAG | Coming |
Use ace_chaos_inject to run a scenario against a target namespace. Always pass a blast radius — the engine will refuse scenarios without one.
# Inject a storage failure for 30 seconds in my test namespace
ace_chaos_inject(
scenario="storage.failure",
namespace="test-chaos-ns",
duration_s=30,
blast_radius={
"max_agents_affected": 3,
"max_error_rate": 0.5
}
)
# Read the report after the run completes
ace_chaos_report(run_id="<returned from inject>")The engine measures baseline steady-state (60s), injects the chaos, measures under-chaos behaviour, then measures recovery. The report includes degradation percentage, recovery time, and any agents that failed to come back.
Every chaos scenario declares a blast radius. The engine refuses to run scenarios without one, and auto-halts a scenario that breaches its limits mid-run.
blast_radius = {
"namespace": "test-chaos-ns", # chaos confined to this namespace
"max_agents_affected": 3, # never hit more than 3 agents
"max_duration_s": 300, # auto-terminate after 5 min
"max_error_rate": 0.5, # halt if error rate exceeds 50%
"environment": "neon_branch" # NOT production, NOT staging
}Enforcement rules
- • A scenario touching any namespace outside the blast radius raises
BlastRadiusViolation - • A scenario exceeding
max_agents_affectedraisesBlastRadiusExceeded - • A scenario whose observed error rate exceeds
max_error_rateauto-rolls back mid-run - • A scenario running longer than
max_duration_sis killed and reported as timed out
After a chaos run, ace_chaos_report(run_id) returns a structured summary you can act on.
{
"run_id": "chaos-2026-04-15-0001",
"scenario": "storage.failure",
"hypothesis": {
"agent_response_p95_ms": 2000,
"agent_success_rate": 0.99
},
"baseline": { "p95_ms": 1850, "success_rate": 0.998 },
"under_chaos": { "p95_ms": 4200, "success_rate": 0.83 },
"recovery": { "time_to_baseline_s": 18 },
"hypothesis_held": false,
"notes": "Success rate dropped below 0.90 threshold during 30s of storage failure"
}If hypothesis_held is false, the system didn't behave as expected — either your hypothesis was wrong or you found a resilience bug. Either way, it's a learning event.
Well-behaved scenarios clean up after themselves — the chaos stops, agents recover, state returns to baseline. If something didn't recover:
- Check
ace_chaos_reportfor the agents-failed-to-recover list - Restart stuck agents via
ace_agent_status(name, action="restart") - Roll back the genome if you injected
genome.bad_mutation: useace_genome_forkfrom a known-good snapshot - Worst case, archive the test namespace and create a fresh one — Neon branches make this near-instant
- • Don't run chaos in production — graduate through Neon branch → staging-chaos org → production
- • Don't run chaos without a blast radius — the engine will refuse, but external scripts might not
- • Don't run chaos with real user data in the namespace — use a fresh test namespace
- • Don't run chaos if you're the only admin — you can lock yourself out of auth scenarios
- • Don't skip the hypothesis — chaos without a hypothesis is noise, not data
- • Don't ignore failed hypotheses — they're the whole point; file them as issues
Related: MCP Tools Reference • Background Agents • Genome Evolution