The division by sqrt(d) keeps score variance from exploding as head dimension grows.
Chapter 02
Architecture
This chapter explains the main structural claims in the report: RoPE-enhanced attention, latent KV compression, sparse experts, auxiliary-loss-free balancing, sequence-wise auxiliary control, node-limited routing, and multi-token prediction.
1. Transformer block
The report is still a transformer story, but every expensive state is being rethought
A token enters through embeddings, moves through attention, sparse FFN computation, and residual updates, then exits through a shared output head. The report's major move is to re-design the attention state and FFN capacity so training and inference costs scale better than a fully dense baseline.
Map token ids into the residual stream.
Use compressed latent KV state plus a dedicated positional path.
Send each token through shared and selected routed experts.
Project the final residual stream back into vocabulary logits.
2. Attention and RoPE
RoPE turns position into rotations on query and key pairs
Attention still starts from scaled dot products: scores = (QK^T) / sqrt(d) . RoPE modifies the query and key vectors before this dot product by rotating 2D coordinate pairs with a position-dependent angle.
For one coordinate pair, position applies a 2D rotation:
Build Q and K from the residual stream.
Apply position-dependent rotations to paired dimensions.
Take the dot product after rotation so relative offsets affect similarity.
The report's special twist is that positional information is carried through a dedicated RoPE path while the main attention state is compressed elsewhere.
3. Latent KV compression
The report compresses what must be remembered across tokens
Instead of storing full keys and values for every head and every token, the report compresses attention state into a latent representation. This makes the cache smaller while reconstructing the pieces needed for attention at compute time.
Map the hidden state into a smaller latent KV representation.
Store the compressed representation instead of full dense K/V tensors.
Reconstruct the attention operands during computation.
This is the architecture choice that later dominates inference economics.
4. Sparse routing
Shared experts absorb common work while routed experts specialize
The report uses a sparse FFN design with always-on shared experts and top-k routed experts. Routing is constrained so each token only talks to a limited number of nodes, which makes expert parallelism more tractable.
Auxiliary-loss-free balancing
The report updates routing bias terms directly instead of relying only on a large per-token balancing penalty in the main objective.
Sequence-wise auxiliary control
A complementary sequence-level term nudges whole sequences away from pathological imbalance.
Node-limited routing
Tokens can only activate experts from a restricted node set, which caps communication spread.
No token-dropping
The report prefers guaranteed processing over dropping overloaded tokens, then solves balance elsewhere.
The main mental model is: route just enough specialization to gain capacity, but constrain the communication graph so the system remains trainable and deployable.
5. Multi-token prediction module
MTP is both a training signal and an inference lever
The report treats multi-token prediction as a small attached prediction stack rather than a replacement for the main head. During training it adds future-token supervision. During inference it can support speculation.
Produce the ordinary next-token logits.
Attach MTP modules that look further ahead in token space.
Reuse embeddings and output heads so the extra prediction path stays cheap.