Skip to content

Agent Smell

Coinage from jared-zoneraich (analogous to code smell): a set of surface-level operational metrics for sanity-checking an agent run when you can't easily eval end-to-end correctness.

What to count

  • How many tool calls did it make?
  • How many retries / error recoveries?
  • Wall-clock time?
  • Which tools did it pick, and in what ratio?

Why it matters

The claude-code-master-loop architecture trades deterministic orchestration for model flexibility — which breaks traditional eval. You can no longer just compare intermediate states to a golden trace because the trace is nondeterministic. Zoneraich's mental model:

  1. Back-test first. Capture historical runs and replay.
  2. Point-in-time tests for specific tool-call decisions.
  3. End-to-end integration tests ("did it fix the bug?").
  4. Agent smell on top as a cheap always-on sanity layer — drift in tool-call counts often predicts a regression before outputs do.

Companion pattern: rigorous tools

Offload determinism to the tools, not the loop. Treat each tool as a pure function with input/output tests (matching the harness-engineering discipline). When a tool is itself a sub-agent, you're back in recursion land and must eval end-to-end.