Maggie Konstanty¶

AI Product Manager at prosus, building and evaluating LLM agents for food ordering and ecommerce at consumer scale. Publicly known for an unusually unfiltered take on the actual state of LLM evaluation — contra marketing-grade narratives from tooling vendors.

Positions¶

Pre-ship evals will not survive first contact with users. "Eval start before you ship the agent … then you have to completely drift from your approach to a different eval. So, you don't continue with the same evaluation system once you have an agent in production."
Accuracy as a top-line number is meaningless. "My agent is accurate 95% of time. What does it mean? … If you're hitting 100% something broken … 95% I'm also like, okay, that's really high. Yeah, something is fishy with this data."
20 uncorrelated evaluators is instruction to fail. "Setting at 20 types of different evaluators that are not connected to your goal, I would say it's stretching to fail." See concepts/ai/eval-lifecycle-pre-to-production.
Silent failure is the real risk. "A lot of users … just drop off. There's not a lot of users that going to tell you, I don't like that." See concepts/ai/silent-failure-dropoff.
Error analysis is detective work. "You're a detective … imagine I'm a user that comes to this platform, literally step into someone's brain." See concepts/ai/error-analysis-as-detective-work.
LLM-as-judge is a tool, not an answer. Skeptical of using another LLM to measure accuracy; prefers business-metric-tied evals, impact-weighted failure modes, and sampled traces over 100% coverage.
Evals as team DNA, not a launch checklist. "Evals should be constant within the development team. Eval start the moment the idea of the product starts."

Tooling stance (2026)¶

Uses arize / Phoenix but refuses their UI for writing evaluators; wrote "I'm not going to go for UI solution" and prefers custom-coded pipelines with API export. Specific pain points: slow trace export (>1k traces = batched, hours); no sampling strategies built in; 100k traces × 6 evaluators at turn level is cost-prohibitive — sampling is mandatory.

Sources¶

maggie-konstanty-evals-2026 — MLOps.community Podcast #372