The Night 8 Agents Crashed
April 7, 2026, 7:16 PM PST — I opened Telegram to check on the agents. Nothing. Silence across every channel.
Anton wasn't responding. Hermes was replaying stale sessions. Code Claw was stuck in an infinite restart loop. Scout was offline. Kevin's SSH was broken. The whole fleet was down.
The Triage
Two hours. Every agent fixed one at a time:
- Anton: Model override needed. Forced back to Opus.
- Code Claw: Gemini thinkingBudget was set to 0. Should have been -1. This caused an infinite crash loop — the model kept trying to "think" with zero budget, erroring, restarting, and trying again.
- Hermes: Replaying a stale session from hours ago. Cleared the stuck state.
- Scout: Keepalive connection dropped. Restarted.
- Kevin: SSH connection broken + wrong model. Fixed both.
- Grok Claw: So broken it had to be rebuilt from scratch.
- xMCP: Spamming 400 errors nonstop. Disabled entirely.
- Brain-recall: Causing more problems than it solved. Disabled.
By 9:17 PM, everything was back online.
What We Learned
1. Cascade failures are real. When one agent crashes, check them all. A shared dependency (like a gateway restart or a model API outage) can take down everything simultaneously.
2. Configuration errors are time bombs. Code Claw's thinkingBudget=0 was set days earlier. It worked fine until the model API changed behavior. Then it became an infinite loop.
3. Some things need to die. xMCP and brain-recall weren't worth fixing. They were removed entirely. Fewer moving parts = fewer failure modes.
4. Recovery playbooks matter. We now have a skill — a documented procedure — for exactly this scenario. Next time, it takes 30 minutes, not 2 hours.
The Irony
Our AI agents are supposed to be self-healing, self-improving autonomous systems. But when they all go down at once, a human still has to sit there at 7 PM on a Monday night and fix them one by one.
That's the reality of production AI. It works beautifully 99% of the time. The other 1% is pure chaos.
We're building for the 1%.