Chapter 07

Evaluations and Limits

The final chapter explains how to read the report's evidence. That means looking at standard benchmarks, open-ended evaluations, appendix deep dives on low precision and expert specialization, and the limits of what the report can claim.

Return to overview Back: post-training

1. Benchmark reading

Benchmarks tell you where the report is strong, not whether every design choice was necessary

Standard benchmarks are useful for broad capability comparison, but they do not by themselves prove that a specific scheduling trick or balancing strategy was the decisive cause. Read them as outcome evidence, not as a direct causal proof for every engineering choice.

What benchmarks answer

Does the trained system compete across known public tasks?

What they do not answer

Would a simpler system with more compute have achieved the same result?

2. Open-ended evaluation

Open-ended review matters because some of the report's claims are about behavior, not only scores

The report supplements standard metrics with more subjective or generative review, including reward-model-style use. This matters because response quality, reasoning style, and usefulness are not fully captured by exact-match tasks.

Why open-ended review exists

Some claims are about interaction quality, style, and reasoning behavior.

Why it is weaker evidence

Judgment quality depends on prompt choice, evaluator quality, and reward model design.

3. Appendix deep dives

The appendices explain where the report's main claims would otherwise feel under-argued

Three appendix-style themes matter most when learning the paper: FP8 vs BF16 training behavior, finer-grained quantization choices, and the emergence of expert specialization under the chosen routing and balancing strategy.

FP8 vs BF16

Shows whether the low-precision framework actually preserves training quality.

Block-wise quantization

Explains why coarser quantization can leave too much useful scale variation on the table.

Expert specialization

Shows whether sparse experts truly separate roles or merely emulate dense capacity with routing overhead.

Why these belong here

They strengthen the causal story behind the headline results.

4. Limits and critique

A good reading ends by separating what the report proves from what it strongly suggests

The paper strongly supports the claim that sparse training and deployment can be made economical with careful systems design.
It suggests, but does not fully isolate, how much each individual trick contributes in a production-quality stack.
It gives unusually concrete hardware suggestions, which is one of its strongest systems contributions.
Its post-training sections show useful behavior gains, but open-ended quality remains harder to measure cleanly than benchmark accuracy.
The best way to read the report is as a tightly integrated stack rather than a menu of independent tricks.