Agent Smell¶
Coinage from jared-zoneraich (analogous to code smell): a set of surface-level operational metrics for sanity-checking an agent run when you can't easily eval end-to-end correctness.
What to count¶
- How many tool calls did it make?
- How many retries / error recoveries?
- Wall-clock time?
- Which tools did it pick, and in what ratio?
Why it matters¶
The claude-code-master-loop architecture trades deterministic orchestration for model flexibility — which breaks traditional eval. You can no longer just compare intermediate states to a golden trace because the trace is nondeterministic. Zoneraich's mental model:
- Back-test first. Capture historical runs and replay.
- Point-in-time tests for specific tool-call decisions.
- End-to-end integration tests ("did it fix the bug?").
- Agent smell on top as a cheap always-on sanity layer — drift in tool-call counts often predicts a regression before outputs do.
Companion pattern: rigorous tools¶
Offload determinism to the tools, not the loop. Treat each tool as a pure function with input/output tests (matching the harness-engineering discipline). When a tool is itself a sub-agent, you're back in recursion land and must eval end-to-end.