Chapter 05

Inference

The report's inference section explains why deployment is shaped by the same architectural decisions introduced earlier. Latent KV state, sparse routing, hot experts, and multi-token prediction all show up again once the model starts generating tokens in real time.

Next: post-training Back: pre-training

1. Prefill vs decode

The report treats inference as two workloads because their bottlenecks are different

Prefill ingests the prompt. Decode generates one token at a time. The first phase has larger parallel chunks and more arithmetic intensity. The second phase repeatedly touches growing cache state and becomes sensitive to memory traffic and routing overhead.

Prefill

More compute-heavy, larger batches, easier to amortize communication.

Decode

More memory-sensitive, smaller steps, tighter latency budget, more exposed cache cost.

2. Cache economics

Latent attention state is mainly an inference decision disguised as an architecture decision

If standard dense attention stores full K and V for every token, cache memory scales directly with sequence length and concurrency. The report's latent compression changes this by reducing what must stay resident between decode steps.

Dense intuition

More tokens means proportionally more full K/V state in memory.

Latent intuition

Store a smaller carrier state, then reconstruct the expensive view only when computing attention.

3. Sparse serving

Hot experts make serving a load-balancing problem, not only a compute problem

Sparse models create uneven load online because some experts are chosen much more often than others. The report therefore discusses redundancy and balancing strategies for serving, not just for training.

Token arrives

Route through the sparse FFN gate.

Hot experts appear

Some experts attract much higher online traffic than others.

Redundant replicas

Duplicate high-pressure experts and re-balance requests to avoid overload.

4. Speculation

Multi-token prediction pays off again by offering a path to faster generation

Because the report already trained extra future-token prediction structure, it can connect that structure back to speculative decoding. The intuition is simple: if the model is already learning to look ahead, some of that signal can be reused to propose tokens earlier.

Primary model step

Generate the current token and hidden state.

Speculative proposal

Use extra prediction structure to propose future tokens.

Verification

Keep proposals that match the main path and discard the rest.