Overview

Read a frontier LLM report as one connected system.

This site is a structured companion for one modern large-model report. It walks from the core loss and optimization math into attention, RoPE, sparse routing, FP8 training, long-context extension, inference, post-training, and evaluation.

Start with foundations Open playground

Prediction math

Logits, cross-entropy, AdamW, precision, and multi-token prediction.

Architecture

RoPE, latent KV compression, sparse experts, routing, balancing.

Infrastructure

Compute clusters, DualPipe, all-to-all overlap, FP8, memory policy.

Pre-training

Data construction, hyper-parameters, long-context extension, ablations.

Inference

Prefill vs decode, cache economics, expert serving, speculative decoding.

Post-training

SFT, reward models, group-relative RL, self-rewarding and refinement.

Evaluation

Benchmarks, open-ended review, appendix deep dives, limits and critique.

The goal is to understand how each layer of the report constrains the next.

Section Summary

Seven chapters that explain the report end to end

01 Foundations

Core math and optimization

Learn the parts of the report that rely on basic training math: cross-entropy, perplexity, AdamW, precision formats, recomputation, and why multi-token prediction gives denser learning signals.

Answers: what the model optimizes and why low precision can still be stable.
Key technical ideas: CE loss, AdamW, FP8/BF16/FP32, accumulation, MTP.

02 Architecture

Attention, RoPE, latent KV, and sparse experts

This chapter explains the model mechanics: transformer blocks, RoPE, latent attention state, sparse FFN design, routing, sequence-wise balance, and why the report avoids token dropping.

Answers: what the architecture stores, routes, compresses, and recomputes.
Key technical ideas: RoPE, MLA, shared experts, routed experts, balancing.

03 Infrastructure

How the training stack is made economical

Focus on cluster layout, overlap-heavy scheduling, all-to-all communication, FP8 training, memory-saving tricks, and what the paper says current hardware still does poorly.

Answers: how sparse training avoids drowning in communication.
Key technical ideas: DualPipe, all-to-all overlap, precision framework, hardware asks.

04 Pre-Training

Data, hyper-parameters, and long-context extension

Covers the report's data pipeline, training mixture, hyper-parameter choices, context-window growth, long-context extension, and the pre-training ablations used to justify design choices.

Answers: what was trained on, how training was scheduled, and how context grew.
Key technical ideas: data construction, YaRN-style extension, ablations, scaling decisions.

05 Inference

Prefill, decode, cache pressure, and sparse serving

Explains the deployment side of the report: why prompt ingestion and token-by-token generation behave differently, how cache size drives cost, and how sparse expert serving needs balancing and redundancy.

Answers: why inference is really two workloads, not one.
Key technical ideas: prefill vs decode, KV cache, hot experts, speculation.

06 Post-Training

SFT, reward models, group-relative RL, and self-rewarding

This chapter covers the report's alignment pipeline: supervised tuning, reasoning distillation, rule-based and model-based rewards, group-relative RL, generative reward-model usage, and refinement through model-generated data.

Answers: how the system moves from strong pre-training to useful behavior.
Key technical ideas: SFT, reward models, GRPO, rejection sampling, self-rewarding.

07 Evaluations

Benchmarks, appendix deep dives, and how to read the claims

The final chapter interprets what the report measures, what its open-ended evaluations do and do not prove, and how the appendix-style deep dives on low precision and expert specialization support the main claims.

Answers: what the report demonstrates, what it leaves open, and how to critique it.
Key technical ideas: benchmark framing, FP8 vs BF16, expert specialization, limitations.