RL with Verifiable Rewards¶

A training paradigm for language models in which the reward signal comes from automatic, deterministic verification of outcomes — rather than from human preference labels or SFT imitation. Popularised by DeepSeek R1 and OpenAI O1; explained and demonstrated by stefano-fiorucci (AI Engineer 2026).

"The underlying idea is more general — it does where the outcome can be verified automatically, like a correct answer, a won game, a successful tool call can serve as a training signal. And this is fundamentally different from supervised fine-tuning."

How it works¶

The model receives a question or task prompt.
The model generates a reasoning trace (chain of thought) + final answer.
The answer is checked against a known-correct outcome by a deterministic verifier (exact match, game-outcome logic, unit tests, etc.).
The resulting scalar reward is fed into an RL algorithm (GRPO, CISPO, PPO) to update model weights.

No human annotator is in the loop for individual examples. The verifier is the label.

Why it beats SFT alone¶

In supervised fine-tuning, the model learns by statistical imitation: completions stay close to the training distribution. This caps performance at human-demonstration quality, and curated reasoning traces are expensive to produce at scale.

In RL with verifiable rewards, the model explores trajectories from its own distribution and learns to favour those that score higher. It can discover reasoning strategies beyond any human example:

"The model is no longer limited by the quality of human examples. Through trial and error, it can discover more efficient reasoning strategies."

Canonical verifiable reward types¶

Correct final answer — math, coding, factual QA with known answers
Won game — tic-tac-toe, chess, competitive environments
Successful tool call — API call returns expected result
Format compliance — regex check on XML/JSON structure
Test passage — code that passes unit tests

Relation to DeepSeek R1 / O1¶

DeepSeek R1 avoided expensive SFT curation for reasoning by using GRPO + verifiable rewards. OpenAI O1 similarly used RL training to make chain-of-thought effective. Both showed that test-time compute (longer thinking) compounds with RL train-time compute.

Cross-references¶

rl-environment-engineering — the infrastructure layer that operationalises this paradigm
llm-wandering — the exploration vs exploitation dynamic that makes RL work
synthetic-sft-bootstrap — SFT warm-up often precedes RL-with-VR training
verifiability-frontier — complementary concept: where automatic verification breaks down
eval-lifecycle-pre-to-production — evaluation as a related verification discipline