Skip to content

Source Index — Maggie Konstanty: Evals in 2026

Who: maggie-konstanty, AI PM at prosus Where: MLOps.community Podcast #372 When: 2026 Duration: 41 minutes (2,457s) URL: https://youtu.be/9EjWR3QpJYk Raw: raw/transcripts/maggie-konstanty-evals-2026.plain.txt

TL;DR

Eval systems you ship with don't survive production. Accuracy metrics lie. 20 uncorrelated evaluators is a fast path to wasted time. Most users don't complain — they drop off. LLM-as-judge is a tool, not an answer. Error analysis is the unglamorous skill that separates teams who actually improve from teams who drown in green dashboards.

Key quotes (grounded)

  • On eval lifecycle: "Eval start before you ship the agent or we ship before you ship the product, and then you have to completely drift from your approach to a different eval."eval-lifecycle-pre-to-production
  • On accuracy: "My agent is accurate 95% of time. What does it mean? … If you're hitting 100% something broken … that's not trustworthy metric to me."
  • On the 20-evaluator trap: "Setting at 20 types of different evaluators that are not connected to your goal, I would say it's stretching to fail."
  • On silent failure: "A lot of users … just drop off. There's not a lot of users that going to tell you, I don't like that."silent-failure-dropoff
  • On error analysis: "You're a detective … literally step into someone's brain."error-analysis-as-detective-work
  • On DNA: "Evals should be constant within the development team. Eval start the moment the idea of the product starts … And it should be part of the DNA."
  • On tooling: "If I export more than 1,000 traces, they suddenly slow down … takes hours. They also don't enable sampling as far as I'm concerned."arize
  • On LeBron shoes: "Test versus queries, like I want to find Adidas size 47 … user would say like I want to have shoes like LeBron James. And you know, that's a different question."
  • On surprise-me: "We came up with kind of like surprise me intent and created a bit of an issue for ourselves cuz how do you answer a question surprise me."
  • On adversarial team evals: "Anybody that breaks it gets 50 bucks Amazon gift card or whatever … really trying to get people thinking outside the box."

New wiki entries from this source