GEPA¶
Prompt-optimization algorithm — genetic-algorithm-style with a Pareto-frontier selection twist. Introduced in an academic paper; popularized as a technique by dspy; now has a standalone open-source gepa library with an optimize_anything API.
The loop¶
Three steps per iteration, run until compute budget is exhausted:
- Sample candidates — from the current bag of prompts, generate new ones via:
- Prompt mutation — run the current judge, observe failures, reflect on them, emit an improved prompt. The reflection step uses LLM intelligence to infer what's wrong.
- Merge strategy — combine two parent prompts (e.g. pick guidelines from prompt A + prompt B).
- Evaluate — run each new candidate against batches (not the full set) of the eval data. Keep those that improve over the seed.
- Pareto-frontier filter — don't just pick highest-average candidates. For each test case, find the best candidate; keep a diverse set that collectively covers all test cases. This preserves diversity across generations.
The Pareto step is the key innovation: average-best selection collapses onto a single local minimum; per-task-best preserves exploration paths.
optimize_anything API¶
optimize_anything(
seed_candidate=..., # prompt, dict, chain — whatever config you're optimizing
evaluator=..., # runs system + logs diagnostics (output, reasoning, errors)
reflection_template=..., # LLM prompt used during the mutation step
...
)
Not limited to single prompts — can optimize temperatures, chains, any structured config.
Practical notes (from Mabrouk's run)¶
- Default reflection template is usually not enough. You have to write a domain-aware one. In the customer-support case, generic reflection couldn't infer "learn the policy"; a custom template that explicitly names the judge verdict, ground truth, and policy-rule synthesis fixed it.
- Smaller models fail. GPT-4o (old), mini/nano, DeepSeek — all failed as judge or refiner for complex policy-adherence. Best results: Gemini for reflection, Grok for judge. GPT-5 mini for both was also workable.
- Overfit first, then generalize. ML-standard move: verify the algorithm can hit 100% on Pareto-frontier for training before expecting validation lift.
- Cost is real. Mabrouk's small experiments: $200–$300 per run in tokens (long trajectories + expensive models for refinement).
- Tune slowly. 200–300 iterations per experiment; batch size and sample size matter; start small, visualize, then scale.
Where it fits¶
- llm-judge-calibration — the primary use case shown
- dspy — the original popularizer
- emergent-cursor-rules — same underlying insight (emergent improvement from observed failure) but manual/human version
Cross-references¶
- llm-as-a-judge — what you're optimizing
- llm-judge-calibration — the end-to-end workflow
- dspy — parent framework
- mahmoud-mabrouk — source