More compute-heavy, larger batches, easier to amortize communication.
Chapter 05
Inference
The report's inference section explains why deployment is shaped by the same architectural decisions introduced earlier. Latent KV state, sparse routing, hot experts, and multi-token prediction all show up again once the model starts generating tokens in real time.
1. Prefill vs decode
The report treats inference as two workloads because their bottlenecks are different
Prefill ingests the prompt. Decode generates one token at a time. The first phase has larger parallel chunks and more arithmetic intensity. The second phase repeatedly touches growing cache state and becomes sensitive to memory traffic and routing overhead.
More memory-sensitive, smaller steps, tighter latency budget, more exposed cache cost.
2. Cache economics
Latent attention state is mainly an inference decision disguised as an architecture decision
If standard dense attention stores full K and V for every token, cache memory scales directly with sequence length and concurrency. The report's latent compression changes this by reducing what must stay resident between decode steps.
More tokens means proportionally more full K/V state in memory.
Store a smaller carrier state, then reconstruct the expensive view only when computing attention.
3. Sparse serving
Hot experts make serving a load-balancing problem, not only a compute problem
Sparse models create uneven load online because some experts are chosen much more often than others. The report therefore discusses redundancy and balancing strategies for serving, not just for training.
Route through the sparse FFN gate.
Some experts attract much higher online traffic than others.
Duplicate high-pressure experts and re-balance requests to avoid overload.
4. Speculation
Multi-token prediction pays off again by offering a path to faster generation
Because the report already trained extra future-token prediction structure, it can connect that structure back to speculative decoding. The intuition is simple: if the model is already learning to look ahead, some of that signal can be reused to propose tokens earlier.
Generate the current token and hidden state.
Use extra prediction structure to propose future tokens.
Keep proposals that match the main path and discard the rest.