Skip to content

LLM Wandering (Exploration vs Exploitation)

The deliberate encouragement of exploratory behaviour in LLMs during RL training — allowing the model to try new strategies rather than over-committing to learned patterns. The title of stefano-fiorucci's talk ("Let LLMs Wander") is a direct reference to this dynamic.

Classic RL tension: an agent must balance exploration (trying new actions to discover better strategies) with exploitation (using actions already known to work). Applied to language models in RL training, this manifests in several practical decisions.

Practical dimensions of LLM wandering

1. Temperature as exploration dial

Fiorucci ran a second training phase after the model plateaued, deliberately raising temperature to force exploration:

"We want the model to experiment with new approaches. And temperature is the right parameter to tweak. But this is a bit risky. If the temperature is too high, the model can start generating gibberish."

The effect was observable in training curves: a significant initial reward drop (the model trying random strategies) followed by recovery and improvement to new highs.

2. Opponent skill as exploration scaffold

Training only against optimal opponents collapsed the model into over-defensive behaviour. Training only against random opponents produced no signal. The solution: a curriculum (see rl-curriculum-opponent-skill) that keeps the model in a productive exploration zone.

3. Reasoning traces enable exploration

Asking the model to produce <think> chain-of-thought traces before answers makes the exploration visible and reinforceable. The thinking trace is not just an inference-time aid — it is a training-time exploration scaffold.

4. Watching for memorisation vs strategy

A hidden failure mode: the model stops exploring and memorises a specific opponent's pattern. Fiorucci discovered this when a biased minimax algorithm (always picking the first optimal move) led to great benchmark results but a clueless model:

"I basically was training my model against a specific type of optimal player. Over many games, the model simply memorized it."

The Karpathy framing

Fiorucci quotes Andrej Karpathy to motivate the whole paradigm:

"They give the LLM an opportunity to actually interact, take actions, see outcomes. This means you can hope to do a lot better than statistical expert imitation."

Wandering — genuine trial-and-error — is what separates RL from SFT. SFT is imitation; RL is experience.

Practical advice

  • Let it run. RL is slow. Fiorucci: "start training and go for a walk." Premature stopping kills slow-bloom improvements.
  • Raise temperature carefully after plateau — too high causes format collapse.
  • Inspect rollouts, not just metrics. Reward curves can look stable while the model memorises a brittle policy.

Cross-references