Agent Smell¶

Coinage from jared-zoneraich (analogous to code smell): a set of surface-level operational metrics for sanity-checking an agent run when you can't easily eval end-to-end correctness.

What to count¶

How many tool calls did it make?
How many retries / error recoveries?
Wall-clock time?
Which tools did it pick, and in what ratio?

Why it matters¶

The claude-code-master-loop architecture trades deterministic orchestration for model flexibility — which breaks traditional eval. You can no longer just compare intermediate states to a golden trace because the trace is nondeterministic. Zoneraich's mental model:

Back-test first. Capture historical runs and replay.
Point-in-time tests for specific tool-call decisions.
End-to-end integration tests ("did it fix the bug?").
Agent smell on top as a cheap always-on sanity layer — drift in tool-call counts often predicts a regression before outputs do.

Companion pattern: rigorous tools¶

Offload determinism to the tools, not the loop. Treat each tool as a pure function with input/output tests (matching the harness-engineering discipline). When a tool is itself a sub-agent, you're back in recursion land and must eval end-to-end.

claude-code · eval-lifecycle-pre-to-production · harness-engineering · llm-as-a-judge

Agent Smell¶

What to count¶

Why it matters¶

Companion pattern: rigorous tools¶

Related¶