Chapter 04

Pre-Training

This chapter translates the paper's pre-training section into concrete questions: what data was built, what hyper-parameters matter, how context length was extended, and how ablations are used to argue for architecture and routing decisions.

1. Data construction

The report treats data quality and data shape as first-class system choices

The paper emphasizes a high-quality, diverse token mixture with strong coverage of math, code, multilingual data, and long documents. The learning point is that model architecture does not stand alone. Data mixture decides what specialization is even available to emerge.

Quality filtering

Improve token usefulness before scaling token count.

Document packing

Use sequence space efficiently rather than wasting context on padding and fragmentation.

Domain mix

Shape downstream reasoning and code performance by changing what the model repeatedly sees.

Long-context readiness

Training data layout has to make later long-context extension feasible, not accidental.

2. Hyper-parameters

The report's hyper-parameters matter because they reveal what the system is optimized for

Model hyper-parameters describe capacity and routing shape. Training hyper-parameters describe how aggressively that capacity is exercised. When reading the paper, think of hyper-parameters as the contract between architecture and optimization.

Reading hyper-parameters
Category What it tells you
Model shape How much dense vs sparse capacity exists, and what the active compute path looks like.
Training schedule How fast the optimizer moves and when the run is allowed to stabilize.
Batching / accumulation How the report trades memory against throughput and update stability.

3. Long-context extension

The report extends context in stages rather than pretending long windows are free

Context length is grown after the base model is already trained. The report uses a staged extension strategy with YaRN-style scaling to preserve behavior while stretching usable context much further.

Base context

Train the model in a shorter, more stable context regime first.

->
Extension phase

Adjust the positional treatment so longer ranges remain numerically usable.

->
Long-window validation

Measure whether the longer context actually improves usable recall and behavior.

4. Evaluation and ablations

The ablations are where the report explains which design choices really carried weight

The paper uses ablations to compare multi-token prediction, balancing strategies, and different views of load control. These matter because they show that the final system is not one single clever idea. It is a stack of interacting choices.

MTP ablations

Show whether denser supervision improves pre-training rather than just adding complexity.

Balance ablations

Compare routing-control strategies and whether they preserve specialization.

Batch-wise vs sequence-wise

Clarify why sequence-level balance can stabilize the traffic pattern seen by the system.

Critical reading

Ask whether the measured deltas justify the added engineering cost in production.