Error Analysis as Detective Work¶

Konstanty's framing: production eval quality is gated by a craft skill most teams never practice — going through failures one at a time, writing out what happened and why, and updating mental models of the user.

The claim¶

"You're a detective … I'm a detective. I'm like, okay, imagine I'm a user that comes to this platform, and I want to, you know, literally step into someone's brain."

And the structural reason teams skip it:

"What I think people don't do very often is that they don't do error analysis, cuz it takes so much time … it seems like so much effort into something that have to change anyway, iterate in a week or two."

Result: teams hesitate, never build the skill, and then don't know how much signal they're leaving on the floor. "First I don't think they even know how much they can get through it cuz they never get through one."

Why it matters¶

The failure modes an LLM agent produces aren't deterministic bugs. "That's not really deterministic. And it's also not a bug that you can systematically improve." So the debugging loop isn't "read stack trace → fix line." It's:

Read a trace.
Ask what the user actually wanted.
Ask what the agent misread or missed.
Formulate a hypothesis about the class of users / queries that would trigger this.
Turn it into a failure mode + an eval case.

This is closer to qualitative UX research than to software QA. The core skill is user empathy at high cadence.

Relation to LLM-as-judge¶

Konstanty uses LLMs to accelerate error analysis but not replace it: "I know it's also a slippery slope cuz you should not try to test LLM with helping you out with something that you know, you judge another LLM. But there are ways that you can actually do it." See llm-as-a-judge · llm-judge-calibration for the Mabrouk-style rigor this demands.

Cross-links¶

eval-lifecycle-pre-to-production
silent-failure-dropoff
agent-taste — the engineer-side analog (taste = internalized error analysis)