Skip to content

Mahmoud Mabrouk — Judge the Judge: LLM Evaluators That Actually Work with GEPA

Source index. ~41-min talk/workshop. Raw at mahmoud-mabrouk-judge-the-judge-gepa-2026.

Thesis

Off-the-shelf LLM-as-a-judge is eval theater. The fix is calibration against human annotations using gepa (genetic-Pareto prompt optimization).

Structure

  1. Problem: hallucination-judges and similar are confidently wrong; same model class = same blind spots
  2. Why we need calibrated judges: offline eval speed, online monitoring, data flywheel
  3. Case study: Tau-Bench airline customer-support agent
  4. Data curation + annotation
  5. GEPA algorithm deep-dive
  6. optimize_anything API walkthrough
  7. Live-notebook walkthrough + results
  8. Failure modes and practical lessons

Concepts introduced

Entities

Key results

  • Accuracy 69% → 74% (train), +14 points (validation)
  • Non-compliant precision 0% → 64% (bias removal is the real win)
  • Pareto-frontier accuracy hits 100% — but merging candidates into one unified prompt remains hard
  • Cost: ~$200-300 per run, ~1 hour wall-clock

Practical lessons

  • Custom reflection templates beat defaults — you have to encode domain priors
  • Smaller/cheaper models fail — Gemini + Grok combo worked; nano/mini-class don't have the capacity for complex-policy reflection
  • Don't leak the policy into the seed — creates local-minima trap
  • Overfit first, then generalize — standard ML hygiene
  • Instrument everything — look at reasoning, check candidates mid-run, don't just watch the final number
  • verifiable-systems-for-agents — parallel problem in eric-zakariasson's software-factory: agents need to self-check, and judges-that-don't-work break that whole layer
  • jagged-intelligence — explains why judges fail on the same dimensions the judged system fails
  • dspy — the parent optimization framework
  • data-flywheel — the final destination; calibrated judges are the gating dependency
  • emergent-cursor-rules — human-intuition version of GEPA's emergent-improvement loop

Open questions

  • Does GEPA scale to subjective metrics (tone, helpfulness) where there's no ground-truth policy to infer?
  • Is there a cheaper substitute — can smaller distilled judges reach the same calibrated accuracy?
  • How often should calibration be re-run? (Judge drift when the underlying agent changes?)
  • What's the annotation-volume sweet spot? Demo used 480 training traces; could it work with 50?