Quality filtering
Improve token usefulness before scaling token count.
Chapter 04
This chapter translates the paper's pre-training section into concrete questions: what data was built, what hyper-parameters matter, how context length was extended, and how ablations are used to argue for architecture and routing decisions.
1. Data construction
The paper emphasizes a high-quality, diverse token mixture with strong coverage of math, code, multilingual data, and long documents. The learning point is that model architecture does not stand alone. Data mixture decides what specialization is even available to emerge.
Improve token usefulness before scaling token count.
Use sequence space efficiently rather than wasting context on padding and fragmentation.
Shape downstream reasoning and code performance by changing what the model repeatedly sees.
Training data layout has to make later long-context extension feasible, not accidental.
2. Hyper-parameters
Model hyper-parameters describe capacity and routing shape. Training hyper-parameters describe how aggressively that capacity is exercised. When reading the paper, think of hyper-parameters as the contract between architecture and optimization.
| Category | What it tells you |
|---|---|
| Model shape | How much dense vs sparse capacity exists, and what the active compute path looks like. |
| Training schedule | How fast the optimizer moves and when the run is allowed to stabilize. |
| Batching / accumulation | How the report trades memory against throughput and update stability. |
3. Long-context extension
Context length is grown after the base model is already trained. The report uses a staged extension strategy with YaRN-style scaling to preserve behavior while stretching usable context much further.
Train the model in a shorter, more stable context regime first.
Adjust the positional treatment so longer ranges remain numerically usable.
Measure whether the longer context actually improves usable recall and behavior.
4. Evaluation and ablations
The paper uses ablations to compare multi-token prediction, balancing strategies, and different views of load control. These matter because they show that the final system is not one single clever idea. It is a stack of interacting choices.
Show whether denser supervision improves pre-training rather than just adding complexity.
Compare routing-control strategies and whether they preserve specialization.
Clarify why sequence-level balance can stabilize the traffic pattern seen by the system.
Ask whether the measured deltas justify the added engineering cost in production.