Query Index — Stefano Fiorucci · Let LLMs Wander: Engineering RL Environments¶

Source: https://youtu.be/71V3fTaUp2Q Event: AI Engineer Conference · uploaded 2026-04-08 · ~40m23s Speaker: stefano-fiorucci (Deepset / Haystack)

New entity¶

stefano-fiorucci

New concepts (6)¶

rl-environment-engineering — building RL environments as distributable Python packages
rl-with-verifiable-rewards — DeepSeek R1 / O1 paradigm: deterministic outcome checking as training signal
llm-wandering — exploration vs exploitation in LLM RL training; temperature as exploration dial
verifiers-library — Prime Intellect's open-source library for RL environment construction
rl-curriculum-opponent-skill — parameterised opponent difficulty + stratified sampling for curriculum learning
synthetic-sft-bootstrap — SFT warm-up phase using teacher-model rollouts before RL

Cross-links to existing wiki pages¶

Training & post-training¶

faye-zhang-subagents-posttraining-2026 — complementary angle: post-training pipelines using sub-agents at Pinterest; both address the post-SFT improvement layer
distill-to-small-task-model — Fiorucci's end result is a canonical instance: small LFM-2 trained to outperform GPT-4o mini (the teacher) on tic-tac-toe
skill-distillation — the synthetic-SFT-bootstrap is a distillation step; RL then extends beyond it
scaling-laws-plural — Fiorucci cites Ilya Sutskever NeurIPS 2024: pre-training scaling limits motivate the shift to RL training and test-time compute

Evaluation & verification¶

eval-lifecycle-pre-to-production — the Verifiers evaluation run (rollout → reward → stats) is an RL-native eval loop; complements Konstanty's eval lifecycle framing
verifiability-frontier — Fiorucci's verifiable rewards paradigm requires tasks where outcomes can be automatically checked; the frontier is the same boundary (correct answers, won games, passed tests)
verifiable-systems-for-agents — Zakariasson's agent self-verification and Fiorucci's verifiable reward tasks are two faces of the same principle: deterministic correctness signals

Agents & harnesses¶

harness-engineering — Lopopolo's harnesses operate at inference time; Fiorucci's environments operate at training time — both are the scaffolding around model action
harness-engineering — direct quote resonance: Fiorucci says environments include "data, harnesses, and scoring rules"
agent-as-junior-engineer — Fiorucci's small-model-beats-large-model result is the task-specialist flip side of the junior-engineer framing
agentic-engineering — environments as "natural gyms for LLM agents that can use tools, run code, and solve multi-step tasks"

Tools & frameworks¶

llm-as-a-judge — contrasted: Verifiers uses deterministic reward functions, not LLM judges; avoids the calibration problems Mabrouk covers
subagent-architecture — ToolEnv / MCPEnv in Verifiers are training-time counterparts to inference-time subagent patterns

Talk structure¶

RL refresher — agent/env/state/action/reward/trajectory/rollout
LLM training pipeline — pre-training → SFT → RL; O1 and DeepSeek R1 as case studies for RL with verifiable rewards
Verifiers library — SingleTurnEnv, MultiTurnEnv, ToolEnv, MCPEnv, StatefulToolEnv, Environments Hub
Tic-tac-toe experiment — LFM-2 (Liquid AI): SFT warm-up (200 synthetic GPT-4o-mini games) → CISPO RL training → outperforms teacher model against optimal opponent
Lessons learned — batch size, hidden opponent bias, model selection, temperature for exploration, patience with RL

Key quotes¶

"We did not just show the model how to play. We gave it a space to play and guided it through rewards."

"If you can define a clear reward signal, you can build an environment and train a small, specialized model to beat a large closed model on a specific task at a fraction of the cost."

"Start training and go for a walk." (RL is slow; resist premature hyperparameter tweaking)