Core math and optimization
Learn the parts of the report that rely on basic training math: cross-entropy, perplexity, AdamW, precision formats, recomputation, and why multi-token prediction gives denser learning signals.
- Answers: what the model optimizes and why low precision can still be stable.
- Key technical ideas: CE loss, AdamW, FP8/BF16/FP32, accumulation, MTP.