Chapter 03

Infrastructure

This chapter explains why the report is really a systems paper as much as a model paper. It covers cluster topology, overlap-heavy execution, all-to-all communication, FP8 training, memory-saving policies, and the hardware gaps the report calls out explicitly.

1. Compute cluster

The report organizes the training problem around communication topology first

Sparse expert models shift the bottleneck toward dispatch and combine traffic, so the cluster is described not just by GPU count but by how tokens travel across nodes and inside nodes. Interconnect shape is part of the model design.

Within node

Use fast local interconnects to move activation shards between GPUs.

->
Across nodes

Use cross-node fabric for expert dispatch whenever routing leaves the local node.

->
Back to compute

Hide that movement behind useful work so communication does not become exposed idle time.

2. Overlap and DualPipe

The report's scheduling story is about hiding all-to-all, not eliminating it

DualPipe is the key scheduling idea. Forward and backward chunks are arranged so attention, dispatch, MLP work, and combine operations can overlap across the pipeline. The win is not fewer messages on paper. It is less exposed waiting.

Forward chunk

Attention -> dispatch -> expert compute -> combine.

<->
Backward chunk

Gradient pieces run while communication for adjacent chunks is still in flight.

This only works because routing was already constrained at the architecture level.

3. FP8 framework

FP8 is presented as a system, not a single dtype toggle

The report's low-precision framework depends on multiple linked decisions: fine-grained quantization, better accumulation behavior, online quantization, low-precision storage, and selective promotion to higher precision for fragile states.

Precision map
Component Why it can or cannot stay low precision
Dense GEMMs High arithmetic intensity makes FP8 attractive if scale factors are handled carefully.
Accumulation paths Need higher precision because tiny numerical drift compounds over long training.
Stored activations Can often be compressed if the system recomputes fragile intermediates later.

4. Memory savings

The report saves memory by deciding what not to keep resident

Three examples matter most: recomputing RMSNorm and MLA up-projections in backward, storing EMA on CPU instead of GPU, and sharing the embedding/output structure with the MTP path instead of duplicating it.

Recompute

Spend extra compute later to avoid storing every activation now.

CPU-side EMA

Keep a tracking copy of weights without consuming expensive GPU memory.

Shared output structure

Reuse parameterized components so auxiliary prediction stays cheaper.

Low-precision storage

Compress what can be stored approximately while keeping fragile states protected.

5. Hardware design asks

The report explicitly says which hardware bottlenecks are still too software-heavy

One unusual strength of the paper is that it closes by asking for better hardware: communication offload, stronger native FP8 accumulation, support for finer-grained quantization, fused online quantization, and dataflow support that reduces the amount of manual software orchestration.

Communication hardware

Move more dispatch/combine work off the general-purpose compute path.

Compute hardware

Improve native support for the precision formats the training recipe actually needs.

Quantization hardware

Support tile- and block-level scaling instead of assuming one coarse scale is enough.

Data movement hardware

Reduce the extra copies and reorder steps required for transposed and mixed-precision workloads.