Silent Failure — User Drop-off as Eval Signal¶
Konstanty's thesis: the most important eval signal in production is one most teams don't instrument — silent user abandonment.
The claim¶
"A lot of users, and that's also what evals why evals matter is they just drop off. There's not a lot of users that going to tell you, 'I don't like that.' … these are the I think the biggest loss that you can have is the user that are, you know, can talking to you and suddenly drop off cuz they're not satisfied."
Thumbs-down is rare. Drop-off is the modal unhappy signal. If your eval surface is "rated interactions + explicit feedback," you are measuring the small vocal tail and missing the silent majority.
Operational moves¶
- Conversion-match traces. Prosus joins conversation traces to conversion outcomes (did the user actually order/buy?). An evaluator's score matters only insofar as it predicts conversion. "We match it with the conversion … which conversation with our what evaluator outcome ended up in conversion. Or which ended up with frustration."
- Frustration signals are noisy but useful. Repetition ("more, give me more, give me more" = 3x failure), caps-lock, curse words — these are behavioral flags. Not silver bullets: "I thought another silver bullet is if there's curse words in it … Somebody might just be angry cuz they had a bad day."
- Edge case: frustration → eventual conversion. "There are also certain cases when users use the product, then they're very frustrated, but eventually end up buying everything." The mapping between emotional valence and outcome is not monotonic.
The "surprise me" trap¶
A concrete product-side lesson from the Prosus food-ordering agent: building a "surprise me" intent because designers imagined users would want it created a self-inflicted eval nightmare — how do you score "answer this open-ended prompt well"? "We came up with kind of like surprise me intent and created a bit of an issue for ourselves cuz how do you answer a question surprise me." See concepts/ai/eval-lifecycle-pre-to-production for why designer-imagined intents routinely miss production reality.
Cross-links¶
- eval-lifecycle-pre-to-production — why pre-ship evals miss this
- error-analysis-as-detective-work — how to trace drop-off back to specific failures