Source Index — Maggie Konstanty: Evals in 2026¶
Who: maggie-konstanty, AI PM at prosus Where: MLOps.community Podcast #372 When: 2026 Duration: 41 minutes (2,457s) URL: https://youtu.be/9EjWR3QpJYk Raw: raw/transcripts/maggie-konstanty-evals-2026.plain.txt
TL;DR¶
Eval systems you ship with don't survive production. Accuracy metrics lie. 20 uncorrelated evaluators is a fast path to wasted time. Most users don't complain — they drop off. LLM-as-judge is a tool, not an answer. Error analysis is the unglamorous skill that separates teams who actually improve from teams who drown in green dashboards.
Key quotes (grounded)¶
- On eval lifecycle: "Eval start before you ship the agent or we ship before you ship the product, and then you have to completely drift from your approach to a different eval." — eval-lifecycle-pre-to-production
- On accuracy: "My agent is accurate 95% of time. What does it mean? … If you're hitting 100% something broken … that's not trustworthy metric to me."
- On the 20-evaluator trap: "Setting at 20 types of different evaluators that are not connected to your goal, I would say it's stretching to fail."
- On silent failure: "A lot of users … just drop off. There's not a lot of users that going to tell you, I don't like that." — silent-failure-dropoff
- On error analysis: "You're a detective … literally step into someone's brain." — error-analysis-as-detective-work
- On DNA: "Evals should be constant within the development team. Eval start the moment the idea of the product starts … And it should be part of the DNA."
- On tooling: "If I export more than 1,000 traces, they suddenly slow down … takes hours. They also don't enable sampling as far as I'm concerned." — arize
- On LeBron shoes: "Test versus queries, like I want to find Adidas size 47 … user would say like I want to have shoes like LeBron James. And you know, that's a different question."
- On surprise-me: "We came up with kind of like surprise me intent and created a bit of an issue for ourselves cuz how do you answer a question surprise me."
- On adversarial team evals: "Anybody that breaks it gets 50 bucks Amazon gift card or whatever … really trying to get people thinking outside the box."
New wiki entries from this source¶
- Entities: maggie-konstanty, prosus, arize
- Concepts: eval-lifecycle-pre-to-production, silent-failure-dropoff, error-analysis-as-detective-work
Cross-links into existing wiki¶
- llm-as-a-judge, llm-judge-calibration, gepa — Mabrouk's calibration rigor is the counterweight to "LLM-judge as default"
- agent-taste — error analysis as PM-side taste
- mahmoud-mabrouk, agenta — tool-vendor view on the same problem
- harness-engineering — Lopopolo's "quality bar" has PM-side analog here: impact-weighted eval coverage