Mahmoud Mabrouk¶

Co-founder and CEO of agenta (open-source LLMOps platform). 15+ years in machine learning; prior academic work on ML applied to computational biology and protein structure prediction. Current focus: sampling and auto-optimization workflows for agent reliability.

Signature message¶

Off-the-shelf LLM-as-a-judge prompts don't work. If you drop "rate whether this output is a hallucination" into production observability, you'll get confident scores that don't correlate with reality — because if the LLM could detect hallucination that way, your app would have worked from day one.

The fix is calibration — optimize judge prompts against human annotations using algorithms like gepa.

Cross-references¶

agenta — his company
gepa — algorithm he demonstrates
llm-as-a-judge — the problem class
llm-judge-calibration — his proposed solution pattern