Chapter 02

Architecture

This chapter explains the main structural claims in the report: RoPE-enhanced attention, latent KV compression, sparse experts, auxiliary-loss-free balancing, sequence-wise auxiliary control, node-limited routing, and multi-token prediction.

1. Transformer block

The report is still a transformer story, but every expensive state is being rethought

A token enters through embeddings, moves through attention, sparse FFN computation, and residual updates, then exits through a shared output head. The report's major move is to re-design the attention state and FFN capacity so training and inference costs scale better than a fully dense baseline.

Embedding

Map token ids into the residual stream.

->
Attention

Use compressed latent KV state plus a dedicated positional path.

->
Sparse FFN

Send each token through shared and selected routed experts.

->
Output head

Project the final residual stream back into vocabulary logits.

2. Attention and RoPE

RoPE turns position into rotations on query and key pairs

Attention still starts from scaled dot products: scores = (QK^T) / sqrt(d) . RoPE modifies the query and key vectors before this dot product by rotating 2D coordinate pairs with a position-dependent angle.

Scaled attention

The division by sqrt(d) keeps score variance from exploding as head dimension grows.

RoPE intuition

For one coordinate pair, position applies a 2D rotation:

[x', y'] = [x cos(theta) - y sin(theta), x sin(theta) + y cos(theta)]
Project

Build Q and K from the residual stream.

->
Rotate

Apply position-dependent rotations to paired dimensions.

->
Score

Take the dot product after rotation so relative offsets affect similarity.

The report's special twist is that positional information is carried through a dedicated RoPE path while the main attention state is compressed elsewhere.

3. Latent KV compression

The report compresses what must be remembered across tokens

Instead of storing full keys and values for every head and every token, the report compresses attention state into a latent representation. This makes the cache smaller while reconstructing the pieces needed for attention at compute time.

Down-project

Map the hidden state into a smaller latent KV representation.

->
Cache latent state

Store the compressed representation instead of full dense K/V tensors.

->
Up-project when needed

Reconstruct the attention operands during computation.

This is the architecture choice that later dominates inference economics.

4. Sparse routing

Shared experts absorb common work while routed experts specialize

The report uses a sparse FFN design with always-on shared experts and top-k routed experts. Routing is constrained so each token only talks to a limited number of nodes, which makes expert parallelism more tractable.

Auxiliary-loss-free balancing

The report updates routing bias terms directly instead of relying only on a large per-token balancing penalty in the main objective.

Sequence-wise auxiliary control

A complementary sequence-level term nudges whole sequences away from pathological imbalance.

Node-limited routing

Tokens can only activate experts from a restricted node set, which caps communication spread.

No token-dropping

The report prefers guaranteed processing over dropping overloaded tokens, then solves balance elsewhere.

The main mental model is: route just enough specialization to gain capacity, but constrain the communication graph so the system remains trainable and deployable.

5. Multi-token prediction module

MTP is both a training signal and an inference lever

The report treats multi-token prediction as a small attached prediction stack rather than a replacement for the main head. During training it adds future-token supervision. During inference it can support speculation.

Main residual stream

Produce the ordinary next-token logits.

->
Extra prediction depth

Attach MTP modules that look further ahead in token space.

->
Shared output structure

Reuse embeddings and output heads so the extra prediction path stays cheap.