Does the trained system compete across known public tasks?
Chapter 07
Evaluations and Limits
The final chapter explains how to read the report's evidence. That means looking at standard benchmarks, open-ended evaluations, appendix deep dives on low precision and expert specialization, and the limits of what the report can claim.
1. Benchmark reading
Benchmarks tell you where the report is strong, not whether every design choice was necessary
Standard benchmarks are useful for broad capability comparison, but they do not by themselves prove that a specific scheduling trick or balancing strategy was the decisive cause. Read them as outcome evidence, not as a direct causal proof for every engineering choice.
Would a simpler system with more compute have achieved the same result?
2. Open-ended evaluation
Open-ended review matters because some of the report's claims are about behavior, not only scores
The report supplements standard metrics with more subjective or generative review, including reward-model-style use. This matters because response quality, reasoning style, and usefulness are not fully captured by exact-match tasks.
Why open-ended review exists
Some claims are about interaction quality, style, and reasoning behavior.
Why it is weaker evidence
Judgment quality depends on prompt choice, evaluator quality, and reward model design.
3. Appendix deep dives
The appendices explain where the report's main claims would otherwise feel under-argued
Three appendix-style themes matter most when learning the paper: FP8 vs BF16 training behavior, finer-grained quantization choices, and the emergence of expert specialization under the chosen routing and balancing strategy.
FP8 vs BF16
Shows whether the low-precision framework actually preserves training quality.
Block-wise quantization
Explains why coarser quantization can leave too much useful scale variation on the table.
Expert specialization
Shows whether sparse experts truly separate roles or merely emulate dense capacity with routing overhead.
Why these belong here
They strengthen the causal story behind the headline results.
4. Limits and critique
A good reading ends by separating what the report proves from what it strongly suggests
- The paper strongly supports the claim that sparse training and deployment can be made economical with careful systems design.
- It suggests, but does not fully isolate, how much each individual trick contributes in a production-quality stack.
- It gives unusually concrete hardware suggestions, which is one of its strongest systems contributions.
- Its post-training sections show useful behavior gains, but open-ended quality remains harder to measure cleanly than benchmark accuracy.
- The best way to read the report is as a tightly integrated stack rather than a menu of independent tricks.