Skip to content

Synthetic SFT Bootstrap (Before RL)

A two-phase training strategy in which a small model first undergoes a short supervised fine-tuning (SFT) warm-up — using synthetically generated data from a larger model — to learn format and basic task validity, before transitioning to RL training for deeper capability development.

Demonstrated by stefano-fiorucci in the tic-tac-toe training experiment (AI Engineer 2026).

"We can use supervised fine-tuning for a warm-up phase, where we teach the model the format and valid moves syntax. We can then use reinforcement learning to build deeper capabilities."

Why the bootstrap is needed

A small model starting RL training with no task-specific tuning may generate: - Wrong output format (doesn't follow XML/JSON tags, no <think> block) - Invalid actions (illegal moves, malformed tool calls) - Highly noisy rollouts that dilute the reward signal

Invalid and malformed completions yield no useful gradient for RL. The SFT phase floors the model's format compliance, ensuring RL training gets clean signal from the start.

How to generate the synthetic data

  1. Take a stronger model (Fiorucci used GPT-4o mini) that already follows format correctly.
  2. Run it through the RL environment to generate rollouts.
  3. Filter out losing games — only keep winning or drawing trajectories. Baking in suboptimal strategies is a risk.
  4. A small corpus suffices: Fiorucci generated only 200 examples.

"Once you have a good environment, generating data requires a single command."

The environment itself does the heavy lifting — the synthetic data generation is a byproduct of having a working evaluator.

SFT tooling

Fiorucci used Prime RL for the SFT run. Training completed in minutes on a 96GB GPU (smaller GPUs also viable). Outcome: near-perfect format compliance, reduction in invalid moves, modest game performance improvement — but strategic depth still missing. That is what RL delivers next.

Broader pattern

This is an instance of the more general distill-to-small-task-model pattern: use a large model to bootstrap synthetic data, fine-tune a small model on it, then train the small model further to out-perform the teacher. Fiorucci's final result: his small RL-trained model outperformed GPT-4o mini (the teacher) against an optimal opponent.

Cross-references