Eval Lifecycle — Pre-ship to Production¶
Konstanty's core claim: the eval system you ship with will not survive first contact with real users, and this is a structural property, not a tooling shortfall.
The lifecycle¶
- Idea phase. Evals start the moment the product idea exists. "Eval start the moment the idea of the product starts." Purpose: keep the team honest about what "good" means before code is written.
- Pre-ship evals. Simulated user scenarios, user profiles, offline datasets. Purpose: "this is kind of evals pre-production is for you not to get like burnt."
- Ship. First real users arrive. Pre-ship assumptions collapse. "You have new users, you have weird users … you don't expect certain types of questions."
- Production evals. Completely different failure modes. Konstanty's canonical example: a shopping agent pre-ship tests "I want to find Adidas size 47" → users in production say "I want to have shoes like LeBron James." Different semantics, different retrieval, different failure shape. "Then, you switch from different failure modes to different failure mode."
- Iteration as DNA. Evals aren't a launch milestone; they're a constant team practice. "Evals should be constant within the development team … And it should be part of the DNA."
The 20-evaluator anti-pattern¶
The structural failure mode of immature teams: setting up a library of generic evaluators (toxicity, hallucination, relevance, fluency, faithfulness, …) that have no line back to a business outcome. Konstanty: "Setting at 20 types of different evaluators that are not connected to your goal … I would say it's stretching to fail." And: "A constant dance of like trying to figure out which metric which evaluator is better for you, what you're trying to aim. That's why, you know, also setting a 20 types of different evaluators that are not connected to your goal. And what's he hiding inside of them. It's kind of a … instruction to fail."
The fix: core evaluators tied to business metrics ("your puppies that you're going to always want to take care of"), with a rotating periphery of exploratory ones. Impact-weighted, not coverage-weighted.
Unit test ≠ eval¶
"We talk about unit test versus you know, LLM testing. And yeah, they usually they yeah, it's not really the same answer … a big misconception for some people like we did unit test. We don't have to do evals." The semantics axis — whether the LLM behaves appropriately in a circumstance — is not measurable via unit tests.
Cross-links¶
- llm-as-a-judge · llm-judge-calibration — judge as eval instrument
- silent-failure-dropoff — the signal pre-ship evals can't see
- error-analysis-as-detective-work — how production evals actually improve