Mahmoud Mabrouk — Judge the Judge: LLM Evaluators That Actually Work with GEPA¶

Source index. ~41-min talk/workshop. Raw at mahmoud-mabrouk-judge-the-judge-gepa-2026.

Thesis¶

Off-the-shelf LLM-as-a-judge is eval theater. The fix is calibration against human annotations using gepa (genetic-Pareto prompt optimization).

Problem: hallucination-judges and similar are confidently wrong; same model class = same blind spots
Why we need calibrated judges: offline eval speed, online monitoring, data flywheel
Case study: Tau-Bench airline customer-support agent
Data curation + annotation
GEPA algorithm deep-dive
optimize_anything API walkthrough
Live-notebook walkthrough + results
Failure modes and practical lessons

Accuracy 69% → 74% (train), +14 points (validation)
Non-compliant precision 0% → 64% (bias removal is the real win)
Pareto-frontier accuracy hits 100% — but merging candidates into one unified prompt remains hard
Cost: ~$200-300 per run, ~1 hour wall-clock

Custom reflection templates beat defaults — you have to encode domain priors
Smaller/cheaper models fail — Gemini + Grok combo worked; nano/mini-class don't have the capacity for complex-policy reflection
Don't leak the policy into the seed — creates local-minima trap
Overfit first, then generalize — standard ML hygiene
Instrument everything — look at reasoning, check candidates mid-run, don't just watch the final number

verifiable-systems-for-agents — parallel problem in eric-zakariasson's software-factory: agents need to self-check, and judges-that-don't-work break that whole layer
jagged-intelligence — explains why judges fail on the same dimensions the judged system fails
dspy — the parent optimization framework
data-flywheel — the final destination; calibrated judges are the gating dependency
emergent-cursor-rules — human-intuition version of GEPA's emergent-improvement loop

Does GEPA scale to subjective metrics (tone, helpfulness) where there's no ground-truth policy to infer?
Is there a cheaper substitute — can smaller distilled judges reach the same calibrated accuracy?
How often should calibration be re-run? (Judge drift when the underlying agent changes?)
What's the annotation-volume sweet spot? Demo used 480 training traces; could it work with 50?