Recompute
Spend extra compute later to avoid storing every activation now.
Chapter 03
This chapter explains why the report is really a systems paper as much as a model paper. It covers cluster topology, overlap-heavy execution, all-to-all communication, FP8 training, memory-saving policies, and the hardware gaps the report calls out explicitly.
1. Compute cluster
Sparse expert models shift the bottleneck toward dispatch and combine traffic, so the cluster is described not just by GPU count but by how tokens travel across nodes and inside nodes. Interconnect shape is part of the model design.
Use fast local interconnects to move activation shards between GPUs.
Use cross-node fabric for expert dispatch whenever routing leaves the local node.
Hide that movement behind useful work so communication does not become exposed idle time.
2. Overlap and DualPipe
DualPipe is the key scheduling idea. Forward and backward chunks are arranged so attention, dispatch, MLP work, and combine operations can overlap across the pipeline. The win is not fewer messages on paper. It is less exposed waiting.
Attention -> dispatch -> expert compute -> combine.
Gradient pieces run while communication for adjacent chunks is still in flight.
This only works because routing was already constrained at the architecture level.
3. FP8 framework
The report's low-precision framework depends on multiple linked decisions: fine-grained quantization, better accumulation behavior, online quantization, low-precision storage, and selective promotion to higher precision for fragile states.
| Component | Why it can or cannot stay low precision |
|---|---|
| Dense GEMMs | High arithmetic intensity makes FP8 attractive if scale factors are handled carefully. |
| Accumulation paths | Need higher precision because tiny numerical drift compounds over long training. |
| Stored activations | Can often be compressed if the system recomputes fragile intermediates later. |
4. Memory savings
Three examples matter most: recomputing RMSNorm and MLA up-projections in backward, storing EMA on CPU instead of GPU, and sharing the embedding/output structure with the MTP path instead of duplicating it.
Spend extra compute later to avoid storing every activation now.
Keep a tracking copy of weights without consuming expensive GPU memory.
Reuse parameterized components so auxiliary prediction stays cheaper.
Compress what can be stored approximately while keeping fragile states protected.
5. Hardware design asks
One unusual strength of the paper is that it closes by asking for better hardware: communication offload, stronger native FP8 accumulation, support for finer-grained quantization, fused online quantization, and dataflow support that reduces the amount of manual software orchestration.
Move more dispatch/combine work off the general-purpose compute path.
Improve native support for the precision formats the training recipe actually needs.
Support tile- and block-level scaling instead of assuming one coarse scale is enough.
Reduce the extra copies and reorder steps required for transposed and mixed-precision workloads.