GEPA¶

Prompt-optimization algorithm — genetic-algorithm-style with a Pareto-frontier selection twist. Introduced in an academic paper; popularized as a technique by dspy; now has a standalone open-source gepa library with an optimize_anything API.

The loop¶

Three steps per iteration, run until compute budget is exhausted:

Sample candidates — from the current bag of prompts, generate new ones via:
Prompt mutation — run the current judge, observe failures, reflect on them, emit an improved prompt. The reflection step uses LLM intelligence to infer what's wrong.
Merge strategy — combine two parent prompts (e.g. pick guidelines from prompt A + prompt B).
Evaluate — run each new candidate against batches (not the full set) of the eval data. Keep those that improve over the seed.
Pareto-frontier filter — don't just pick highest-average candidates. For each test case, find the best candidate; keep a diverse set that collectively covers all test cases. This preserves diversity across generations.

The Pareto step is the key innovation: average-best selection collapses onto a single local minimum; per-task-best preserves exploration paths.

`optimize_anything` API¶

optimize_anything(
  seed_candidate=...,   # prompt, dict, chain — whatever config you're optimizing
  evaluator=...,        # runs system + logs diagnostics (output, reasoning, errors)
  reflection_template=..., # LLM prompt used during the mutation step
  ...
)

Not limited to single prompts — can optimize temperatures, chains, any structured config.

Practical notes (from Mabrouk's run)¶

Default reflection template is usually not enough. You have to write a domain-aware one. In the customer-support case, generic reflection couldn't infer "learn the policy"; a custom template that explicitly names the judge verdict, ground truth, and policy-rule synthesis fixed it.
Smaller models fail. GPT-4o (old), mini/nano, DeepSeek — all failed as judge or refiner for complex policy-adherence. Best results: Gemini for reflection, Grok for judge. GPT-5 mini for both was also workable.
Overfit first, then generalize. ML-standard move: verify the algorithm can hit 100% on Pareto-frontier for training before expecting validation lift.
Cost is real. Mabrouk's small experiments: $200–$300 per run in tokens (long trajectories + expensive models for refinement).
Tune slowly. 200–300 iterations per experiment; batch size and sample size matter; start small, visualize, then scale.

Where it fits¶

llm-judge-calibration — the primary use case shown
dspy — the original popularizer
emergent-cursor-rules — same underlying insight (emergent improvement from observed failure) but manual/human version

Cross-references¶

llm-as-a-judge — what you're optimizing
llm-judge-calibration — the end-to-end workflow
dspy — parent framework
mahmoud-mabrouk — source