LLM Judge Calibration¶
End-to-end workflow for turning a naive llm-as-a-judge into one that actually correlates with human quality judgment. Mabrouk's recipe from the Tau-Bench airline-agent demo.
Recipe¶
- Design the metric. What are you actually scoring? In the demo: policy-adherence for airline customer-support agent (compliant vs non-compliant).
- Curate data + annotations. Hundreds of human-labeled traces split into training (480 traces in demo) + validation (112). Avoid cross-contamination. Demo used AI-generated annotations as a stand-in — weaker signal, but works for illustration.
- Choose a seed judge prompt. Simple is fine: "Evaluate whether this customer service agent violated policy. Assume compliant unless evidence otherwise."
- Write a domain-aware reflection template. Critical. The default gepa template couldn't infer "learn the policy" — you need to tell the reflection LLM to: look at judge verdict vs ground truth, extract policy rules, restructure existing rules, reward clarity.
- Run GEPA.
optimize_anythingwith evaluator that logs trajectories + annotations. ~200-300 iterations. Cost: $200-300. Takes ~1 hour. - Validate on held-out set. Demo lifted accuracy 69% → 74% on train, ~14 points on validation; crucially, removed compliance-bias (non-compliant precision 0% → 64%).
Pitfalls¶
- Cheap models fail as judge or refiner on anything non-trivial. Budget for expensive models in the loop.
- Don't pre-load the agent's policy into the judge prompt. Mabrouk found that seed prompts with the policy baked in trapped the optimizer in a local minimum — optimizer couldn't diverge. Seeds without the policy (but with annotations that describe it) explored more freely.
- Pareto-frontier hitting 100% is not success. It means you have a set of candidates that collectively cover all tasks — but merging them into one prompt that solves everything is still hard.
- Trajectory-length dominates cost. Long agent traces → lots of input tokens → expensive refinement. Consider truncation / summarization before judge calls in production.
Outputs¶
- A calibrated rubric — the optimized judge prompt now encodes policy-like rules (flight cancellation handling, flight-modification communication protocol, etc.) it inferred from annotations.
- Less-biased judge — the pre-optimization judge labeled everything "compliant" (98% bias); post-optimization, 64% — real discrimination.
Why this matters¶
Without calibration, online evals and offline evals are both theater. Judges generate plausible scores that correlate with nothing. Teams "move fast but don't go anywhere." Calibration converts the eval loop from noise to signal, which is what unlocks the real flywheel: observation → new eval → re-optimize → improve.
Cross-references¶
- llm-as-a-judge — the naive form this fixes
- gepa — the optimization algorithm
- verifiable-systems-for-agents — related need in coding-agent context
- data-flywheel — the ultimate goal (concept gap; could add page)