Mahmoud Mabrouk — Judge the Judge: LLM Evaluators That Actually Work with GEPA¶
Source index. ~41-min talk/workshop. Raw at mahmoud-mabrouk-judge-the-judge-gepa-2026.
Thesis¶
Off-the-shelf LLM-as-a-judge is eval theater. The fix is calibration against human annotations using gepa (genetic-Pareto prompt optimization).
Structure¶
- Problem: hallucination-judges and similar are confidently wrong; same model class = same blind spots
- Why we need calibrated judges: offline eval speed, online monitoring, data flywheel
- Case study: Tau-Bench airline customer-support agent
- Data curation + annotation
- GEPA algorithm deep-dive
optimize_anythingAPI walkthrough- Live-notebook walkthrough + results
- Failure modes and practical lessons
Concepts introduced¶
- llm-as-a-judge — the pattern + why the default fails
- llm-judge-calibration — the end-to-end workflow
- gepa — the optimization algorithm with Pareto-frontier selection
Entities¶
- mahmoud-mabrouk — speaker
- agenta — his LLMOps platform, used to instrument the GEPA run
Key results¶
- Accuracy 69% → 74% (train), +14 points (validation)
- Non-compliant precision 0% → 64% (bias removal is the real win)
- Pareto-frontier accuracy hits 100% — but merging candidates into one unified prompt remains hard
- Cost: ~$200-300 per run, ~1 hour wall-clock
Practical lessons¶
- Custom reflection templates beat defaults — you have to encode domain priors
- Smaller/cheaper models fail — Gemini + Grok combo worked; nano/mini-class don't have the capacity for complex-policy reflection
- Don't leak the policy into the seed — creates local-minima trap
- Overfit first, then generalize — standard ML hygiene
- Instrument everything — look at reasoning, check candidates mid-run, don't just watch the final number
Cross-ingest links¶
- verifiable-systems-for-agents — parallel problem in eric-zakariasson's software-factory: agents need to self-check, and judges-that-don't-work break that whole layer
- jagged-intelligence — explains why judges fail on the same dimensions the judged system fails
- dspy — the parent optimization framework
- data-flywheel — the final destination; calibrated judges are the gating dependency
- emergent-cursor-rules — human-intuition version of GEPA's emergent-improvement loop
Open questions¶
- Does GEPA scale to subjective metrics (tone, helpfulness) where there's no ground-truth policy to infer?
- Is there a cheaper substitute — can smaller distilled judges reach the same calibrated accuracy?
- How often should calibration be re-run? (Judge drift when the underlying agent changes?)
- What's the annotation-volume sweet spot? Demo used 480 training traces; could it work with 50?