Skip to content

Eval Lifecycle — Pre-ship to Production

Konstanty's core claim: the eval system you ship with will not survive first contact with real users, and this is a structural property, not a tooling shortfall.

The lifecycle

  1. Idea phase. Evals start the moment the product idea exists. "Eval start the moment the idea of the product starts." Purpose: keep the team honest about what "good" means before code is written.
  2. Pre-ship evals. Simulated user scenarios, user profiles, offline datasets. Purpose: "this is kind of evals pre-production is for you not to get like burnt."
  3. Ship. First real users arrive. Pre-ship assumptions collapse. "You have new users, you have weird users … you don't expect certain types of questions."
  4. Production evals. Completely different failure modes. Konstanty's canonical example: a shopping agent pre-ship tests "I want to find Adidas size 47" → users in production say "I want to have shoes like LeBron James." Different semantics, different retrieval, different failure shape. "Then, you switch from different failure modes to different failure mode."
  5. Iteration as DNA. Evals aren't a launch milestone; they're a constant team practice. "Evals should be constant within the development team … And it should be part of the DNA."

The 20-evaluator anti-pattern

The structural failure mode of immature teams: setting up a library of generic evaluators (toxicity, hallucination, relevance, fluency, faithfulness, …) that have no line back to a business outcome. Konstanty: "Setting at 20 types of different evaluators that are not connected to your goal … I would say it's stretching to fail." And: "A constant dance of like trying to figure out which metric which evaluator is better for you, what you're trying to aim. That's why, you know, also setting a 20 types of different evaluators that are not connected to your goal. And what's he hiding inside of them. It's kind of a … instruction to fail."

The fix: core evaluators tied to business metrics ("your puppies that you're going to always want to take care of"), with a rotating periphery of exploratory ones. Impact-weighted, not coverage-weighted.

Unit test ≠ eval

"We talk about unit test versus you know, LLM testing. And yeah, they usually they yeah, it's not really the same answer … a big misconception for some people like we did unit test. We don't have to do evals." The semantics axis — whether the LLM behaves appropriately in a circumstance — is not measurable via unit tests.