RL Curriculum via Opponent Skill¶
A curriculum learning technique for RL environments in which the difficulty of the training task is controlled by parameterising an opponent's (or world's) skill level — rather than constructing separate task tiers or datasets. Demonstrated by stefano-fiorucci in the tic-tac-toe training experiment.
"If the opponent is too perfect too early, the model might never see a win and fail to learn. We can do so by introducing a probability for the opponent to choose a random move instead of the optimal one."
The core mechanism¶
An optimal opponent (minimax algorithm) is augmented with a random-move probability parameter p ∈ [0, 1]:
p = 0→ fully optimal opponent (hardest; model rarely wins, sparse reward signal)p = 1→ fully random opponent (easiest; model can win but learns no strategy against skilled play)p ∈ (0.2, 0.7)→ productive zone: enough wins to learn attack; enough defence challenges to learn blocking
The training configuration exposes min_random_move_prob and max_random_move_prob, spanning the desired curriculum range across the dataset.
Stratified sampling for curriculum stability¶
In GRPO training, batch composition matters: if a small batch happens to contain only very-hard or very-easy opponents, the average reward fluctuates wildly, destabilising weight updates.
Fiorucci's solution: stratified sampling that forces every batch to contain a perfectly balanced mix of opponent difficulty across the configured range. This keeps the gradient signal consistent regardless of batch size.
"This forces every batch to contain a perfectly balanced mix of opponent difficulty spanning the chosen range."
Noise reduction within rollouts¶
Beyond batch-level curriculum control, per-turn determinism is enforced using seeds:
- An example seed is assigned to each dataset example (determines starting player, etc.)
- A turn seed is derived from example seed + current board state
- Guarantee: if two rollouts reach the same board position, the opponent always responds identically — reward differences reflect model behaviour, not environment randomness
This is critical for GRPO, which compares rollouts from the same starting point to compute advantages.
Failure mode: collapsing to over-defensive play¶
In a later training run (higher temperature, tighter opponent skill range), Fiorucci observed:
"The model became overly defensive and failed to exploit errors when tested against random players."
This occurred when the opponent skill range was set too narrow (100% optimal). The curriculum collapsed — the model optimised for surviving optimal play but lost the ability to exploit weaker opponents. Solution: always maintain some spread of difficulty.
Cross-references¶
- rl-environment-engineering — the environment design within which this curriculum operates
- llm-wandering — exploration dynamics that the curriculum scaffolds
- synthetic-sft-bootstrap — SFT warm-up precedes curriculum RL training
- rl-with-verifiable-rewards — the training algorithm the curriculum feeds signal into