Skip to content

GEPA

Prompt-optimization algorithm — genetic-algorithm-style with a Pareto-frontier selection twist. Introduced in an academic paper; popularized as a technique by dspy; now has a standalone open-source gepa library with an optimize_anything API.

The loop

Three steps per iteration, run until compute budget is exhausted:

  1. Sample candidates — from the current bag of prompts, generate new ones via:
  2. Prompt mutation — run the current judge, observe failures, reflect on them, emit an improved prompt. The reflection step uses LLM intelligence to infer what's wrong.
  3. Merge strategy — combine two parent prompts (e.g. pick guidelines from prompt A + prompt B).
  4. Evaluate — run each new candidate against batches (not the full set) of the eval data. Keep those that improve over the seed.
  5. Pareto-frontier filter — don't just pick highest-average candidates. For each test case, find the best candidate; keep a diverse set that collectively covers all test cases. This preserves diversity across generations.

The Pareto step is the key innovation: average-best selection collapses onto a single local minimum; per-task-best preserves exploration paths.

optimize_anything API

optimize_anything(
  seed_candidate=...,   # prompt, dict, chain — whatever config you're optimizing
  evaluator=...,        # runs system + logs diagnostics (output, reasoning, errors)
  reflection_template=..., # LLM prompt used during the mutation step
  ...
)

Not limited to single prompts — can optimize temperatures, chains, any structured config.

Practical notes (from Mabrouk's run)

  • Default reflection template is usually not enough. You have to write a domain-aware one. In the customer-support case, generic reflection couldn't infer "learn the policy"; a custom template that explicitly names the judge verdict, ground truth, and policy-rule synthesis fixed it.
  • Smaller models fail. GPT-4o (old), mini/nano, DeepSeek — all failed as judge or refiner for complex policy-adherence. Best results: Gemini for reflection, Grok for judge. GPT-5 mini for both was also workable.
  • Overfit first, then generalize. ML-standard move: verify the algorithm can hit 100% on Pareto-frontier for training before expecting validation lift.
  • Cost is real. Mabrouk's small experiments: $200–$300 per run in tokens (long trajectories + expensive models for refinement).
  • Tune slowly. 200–300 iterations per experiment; batch size and sample size matter; start small, visualize, then scale.

Where it fits

Cross-references