Skip to content

Query Index — Stefano Fiorucci · Let LLMs Wander: Engineering RL Environments

Source: https://youtu.be/71V3fTaUp2Q Event: AI Engineer Conference · uploaded 2026-04-08 · ~40m23s Speaker: stefano-fiorucci (Deepset / Haystack)

New entity

New concepts (6)

Training & post-training

  • faye-zhang-subagents-posttraining-2026 — complementary angle: post-training pipelines using sub-agents at Pinterest; both address the post-SFT improvement layer
  • distill-to-small-task-model — Fiorucci's end result is a canonical instance: small LFM-2 trained to outperform GPT-4o mini (the teacher) on tic-tac-toe
  • skill-distillation — the synthetic-SFT-bootstrap is a distillation step; RL then extends beyond it
  • scaling-laws-plural — Fiorucci cites Ilya Sutskever NeurIPS 2024: pre-training scaling limits motivate the shift to RL training and test-time compute

Evaluation & verification

  • eval-lifecycle-pre-to-production — the Verifiers evaluation run (rollout → reward → stats) is an RL-native eval loop; complements Konstanty's eval lifecycle framing
  • verifiability-frontier — Fiorucci's verifiable rewards paradigm requires tasks where outcomes can be automatically checked; the frontier is the same boundary (correct answers, won games, passed tests)
  • verifiable-systems-for-agents — Zakariasson's agent self-verification and Fiorucci's verifiable reward tasks are two faces of the same principle: deterministic correctness signals

Agents & harnesses

  • harness-engineering — Lopopolo's harnesses operate at inference time; Fiorucci's environments operate at training time — both are the scaffolding around model action
  • harness-engineering — direct quote resonance: Fiorucci says environments include "data, harnesses, and scoring rules"
  • agent-as-junior-engineer — Fiorucci's small-model-beats-large-model result is the task-specialist flip side of the junior-engineer framing
  • agentic-engineering — environments as "natural gyms for LLM agents that can use tools, run code, and solve multi-step tasks"

Tools & frameworks

  • llm-as-a-judge — contrasted: Verifiers uses deterministic reward functions, not LLM judges; avoids the calibration problems Mabrouk covers
  • subagent-architecture — ToolEnv / MCPEnv in Verifiers are training-time counterparts to inference-time subagent patterns

Talk structure

  1. RL refresher — agent/env/state/action/reward/trajectory/rollout
  2. LLM training pipeline — pre-training → SFT → RL; O1 and DeepSeek R1 as case studies for RL with verifiable rewards
  3. Verifiers librarySingleTurnEnv, MultiTurnEnv, ToolEnv, MCPEnv, StatefulToolEnv, Environments Hub
  4. Tic-tac-toe experiment — LFM-2 (Liquid AI): SFT warm-up (200 synthetic GPT-4o-mini games) → CISPO RL training → outperforms teacher model against optimal opponent
  5. Lessons learned — batch size, hidden opponent bias, model selection, temperature for exploration, patience with RL

Key quotes

"We did not just show the model how to play. We gave it a space to play and guided it through rewards."

"If you can define a clear reward signal, you can build an environment and train a small, specialized model to beat a large closed model on a specific task at a fraction of the cost."

"Start training and go for a walk." (RL is slow; resist premature hyperparameter tweaking)