Source Index — Maggie Konstanty: Evals in 2026¶

Who: maggie-konstanty, AI PM at prosus Where: MLOps.community Podcast #372 When: 2026 Duration: 41 minutes (2,457s) URL: https://youtu.be/9EjWR3QpJYk Raw: raw/transcripts/maggie-konstanty-evals-2026.plain.txt

TL;DR¶

Eval systems you ship with don't survive production. Accuracy metrics lie. 20 uncorrelated evaluators is a fast path to wasted time. Most users don't complain — they drop off. LLM-as-judge is a tool, not an answer. Error analysis is the unglamorous skill that separates teams who actually improve from teams who drown in green dashboards.

Key quotes (grounded)¶

On eval lifecycle: "Eval start before you ship the agent or we ship before you ship the product, and then you have to completely drift from your approach to a different eval." — eval-lifecycle-pre-to-production
On accuracy: "My agent is accurate 95% of time. What does it mean? … If you're hitting 100% something broken … that's not trustworthy metric to me."
On the 20-evaluator trap: "Setting at 20 types of different evaluators that are not connected to your goal, I would say it's stretching to fail."
On silent failure: "A lot of users … just drop off. There's not a lot of users that going to tell you, I don't like that." — silent-failure-dropoff
On error analysis: "You're a detective … literally step into someone's brain." — error-analysis-as-detective-work
On DNA: "Evals should be constant within the development team. Eval start the moment the idea of the product starts … And it should be part of the DNA."
On tooling: "If I export more than 1,000 traces, they suddenly slow down … takes hours. They also don't enable sampling as far as I'm concerned." — arize
On LeBron shoes: "Test versus queries, like I want to find Adidas size 47 … user would say like I want to have shoes like LeBron James. And you know, that's a different question."
On surprise-me: "We came up with kind of like surprise me intent and created a bit of an issue for ourselves cuz how do you answer a question surprise me."
On adversarial team evals: "Anybody that breaks it gets 50 bucks Amazon gift card or whatever … really trying to get people thinking outside the box."

New wiki entries from this source¶

Entities: maggie-konstanty, prosus, arize
Concepts: eval-lifecycle-pre-to-production, silent-failure-dropoff, error-analysis-as-detective-work

Cross-links into existing wiki¶

llm-as-a-judge, llm-judge-calibration, gepa — Mabrouk's calibration rigor is the counterweight to "LLM-judge as default"
agent-taste — error analysis as PM-side taste
mahmoud-mabrouk, agenta — tool-vendor view on the same problem
harness-engineering — Lopopolo's "quality bar" has PM-side analog here: impact-weighted eval coverage