Used to transfer reflection, verification, and multi-step solution behavior.
Chapter 06
Post-Training
The report's post-training pipeline is not just "alignment happened here." It separates supervised tuning, teacher-driven reasoning distillation, rule-based and model-based rewards, group-relative RL, and self-improvement through filtered generations.
1. Supervised tuning
The report uses SFT to shape behavior before any reward-driven optimization begins
The SFT stage mixes reasoning-heavy and non-reasoning data. The teaching point is that the report is not only trying to make the model think more. It is trying to make the model answer in a way that remains useful, legible, and controlled.
Used to keep style, format, usefulness, and non-chain-of-thought behavior strong.
2. Reward models
The report splits reward design by whether correctness is directly verifiable
For tasks with exact answers or executable validation, the paper prefers rule-based reward signals. For open-ended tasks, it uses model-based reward estimation. That split matters because reward design is treated as an engineering choice, not an ideological preference.
Math answers, code execution, or exact-match outcomes.
Use rule-based reward because it is grounded and harder to exploit.
Fall back to a learned reward model for open-ended quality judgment.
3. Group-relative RL
The report avoids a heavyweight critic by comparing sampled outputs within a group
Group-relative RL computes advantages from the relative ranking of outputs for the same prompt. The mental model is: sample several responses, score them, then push probability mass toward the better ones without fitting a full critic network of the same scale.
Advantage is computed from how a sample performs against its peer responses for the same prompt.
This reduces training overhead while keeping the update tied to directly comparable generations.
4. Self-rewarding and refinement
The report treats model-generated data as both a risk and a useful loop
After RL-style improvement, the report uses model generations as another source of training data, filtered through reward and rejection criteria. It also studies use as a generative reward model and discusses self-rewarding behavior explicitly.
Create multiple answers with the current improved policy.
Use rules, reward models, or voting to keep only high-signal examples.
Feed the kept examples back into later supervised or preference-driven stages.