Skip to content

RL Curriculum via Opponent Skill

A curriculum learning technique for RL environments in which the difficulty of the training task is controlled by parameterising an opponent's (or world's) skill level — rather than constructing separate task tiers or datasets. Demonstrated by stefano-fiorucci in the tic-tac-toe training experiment.

"If the opponent is too perfect too early, the model might never see a win and fail to learn. We can do so by introducing a probability for the opponent to choose a random move instead of the optimal one."

The core mechanism

An optimal opponent (minimax algorithm) is augmented with a random-move probability parameter p ∈ [0, 1]:

  • p = 0 → fully optimal opponent (hardest; model rarely wins, sparse reward signal)
  • p = 1 → fully random opponent (easiest; model can win but learns no strategy against skilled play)
  • p ∈ (0.2, 0.7) → productive zone: enough wins to learn attack; enough defence challenges to learn blocking

The training configuration exposes min_random_move_prob and max_random_move_prob, spanning the desired curriculum range across the dataset.

Stratified sampling for curriculum stability

In GRPO training, batch composition matters: if a small batch happens to contain only very-hard or very-easy opponents, the average reward fluctuates wildly, destabilising weight updates.

Fiorucci's solution: stratified sampling that forces every batch to contain a perfectly balanced mix of opponent difficulty across the configured range. This keeps the gradient signal consistent regardless of batch size.

"This forces every batch to contain a perfectly balanced mix of opponent difficulty spanning the chosen range."

Noise reduction within rollouts

Beyond batch-level curriculum control, per-turn determinism is enforced using seeds:

  • An example seed is assigned to each dataset example (determines starting player, etc.)
  • A turn seed is derived from example seed + current board state
  • Guarantee: if two rollouts reach the same board position, the opponent always responds identically — reward differences reflect model behaviour, not environment randomness

This is critical for GRPO, which compares rollouts from the same starting point to compute advantages.

Failure mode: collapsing to over-defensive play

In a later training run (higher temperature, tighter opponent skill range), Fiorucci observed:

"The model became overly defensive and failed to exploit errors when tested against random players."

This occurred when the opponent skill range was set too narrow (100% optimal). The curriculum collapsed — the model optimised for surviving optimal play but lost the ability to exploit weaker opponents. Solution: always maintain some spread of difficulty.

Cross-references