Mahmoud Mabrouk¶
Co-founder and CEO of agenta (open-source LLMOps platform). 15+ years in machine learning; prior academic work on ML applied to computational biology and protein structure prediction. Current focus: sampling and auto-optimization workflows for agent reliability.
Signature message¶
Off-the-shelf LLM-as-a-judge prompts don't work. If you drop "rate whether this output is a hallucination" into production observability, you'll get confident scores that don't correlate with reality — because if the LLM could detect hallucination that way, your app would have worked from day one.
The fix is calibration — optimize judge prompts against human annotations using algorithms like gepa.
Cross-references¶
- agenta — his company
- gepa — algorithm he demonstrates
- llm-as-a-judge — the problem class
- llm-judge-calibration — his proposed solution pattern