Chapter 06

Post-Training

The report's post-training pipeline is not just "alignment happened here." It separates supervised tuning, teacher-driven reasoning distillation, rule-based and model-based rewards, group-relative RL, and self-improvement through filtered generations.

1. Supervised tuning

The report uses SFT to shape behavior before any reward-driven optimization begins

The SFT stage mixes reasoning-heavy and non-reasoning data. The teaching point is that the report is not only trying to make the model think more. It is trying to make the model answer in a way that remains useful, legible, and controlled.

Reasoning data

Used to transfer reflection, verification, and multi-step solution behavior.

Non-reasoning data

Used to keep style, format, usefulness, and non-chain-of-thought behavior strong.

2. Reward models

The report splits reward design by whether correctness is directly verifiable

For tasks with exact answers or executable validation, the paper prefers rule-based reward signals. For open-ended tasks, it uses model-based reward estimation. That split matters because reward design is treated as an engineering choice, not an ideological preference.

Verifiable task?

Math answers, code execution, or exact-match outcomes.

->
Yes

Use rule-based reward because it is grounded and harder to exploit.

->
No

Fall back to a learned reward model for open-ended quality judgment.

3. Group-relative RL

The report avoids a heavyweight critic by comparing sampled outputs within a group

Group-relative RL computes advantages from the relative ranking of outputs for the same prompt. The mental model is: sample several responses, score them, then push probability mass toward the better ones without fitting a full critic network of the same scale.

Group advantage intuition

Advantage is computed from how a sample performs against its peer responses for the same prompt.

Why it matters

This reduces training overhead while keeping the update tied to directly comparable generations.

4. Self-rewarding and refinement

The report treats model-generated data as both a risk and a useful loop

After RL-style improvement, the report uses model generations as another source of training data, filtered through reward and rejection criteria. It also studies use as a generative reward model and discusses self-rewarding behavior explicitly.

Generate candidates

Create multiple answers with the current improved policy.

->
Score and filter

Use rules, reward models, or voting to keep only high-signal examples.

->
Refine again

Feed the kept examples back into later supervised or preference-driven stages.