Key Moments
- NVIDIA reports up to 48% higher training throughput for reinforcement learning using an end-to-end FP8 precision recipe versus BF16 baselines.
- Validation accuracy for Llama 3.1 8B Instruct with full FP8 usage reached 0.613, closely tracking the BF16 result of 0.616.
- Dynamic quantization scale recalibration adds an estimated 2-3% overhead but enables substantial performance gains across models, including Qwen3-30B.
Aligning Precision Across RL Generation and Training
NVIDIA has introduced a detailed FP8 precision methodology for reinforcement learning that targets up to 48% faster training throughput while maintaining accuracy comparable to BF16-based workflows. The approach, outlined in a technical blog by NVIDIA’s Guyue Huang, is positioned as a lever for reducing AI infrastructure costs and improving the economics of GPU compute.
The work focuses on a longstanding challenge in reinforcement learning pipelines: discrepancies between the generation and training stages when they operate at different numerical precisions and rely on separate execution engines. In conventional RL setups, vLLM is used for rollouts and Megatron Core for training, each employing distinct CUDA kernels. At lower precision formats, these implementation differences can accumulate and lead to notable numerical divergence, which has historically limited the practical use of FP8.
Unified FP8 Strategy and Accuracy Outcomes
NVIDIA’s strategy is to apply FP8 uniformly across both the generation and training components instead of mixing precision levels between them. In experiments with the Llama 3.1 8B Instruct model, end-to-end FP8 usage produced a validation accuracy of 0.613, compared with 0.616 for a BF16 configuration. By contrast, enabling FP8 only for the generation phase lowered accuracy to 0.586, underscoring the impact of consistent precision management throughout the RL loop.
The FP8 recipe employs block-wise quantized FP8 in the E4M3 format. Quantization granularity is set to 128×128 for weights and 1×128 for activations. Under this configuration, linear layers are executed with FP8 arithmetic and are designed to reach 2x theoretical peak throughput relative to BF16. Other components such as attention mechanisms, normalization layers, and non-linear operations remain in BF16 precision.
| Component / Setting | Precision | Details / Result |
|---|---|---|
| Llama 3.1 8B Instruct – BF16 | BF16 | Validation accuracy 0.616 |
| Llama 3.1 8B Instruct – End-to-end FP8 | FP8 (E4M3) | Validation accuracy 0.613 |
| Llama 3.1 8B Instruct – FP8 generation only | Mixed (FP8/BF16) | Validation accuracy 0.586 |
| Linear layers | FP8 | Up to 2x theoretical throughput vs BF16 |
| Attention, normalization, non-linear ops | BF16 | Remain in BF16 precision |
Performance Characteristics and Overheads
Focusing on linear layers alone, the FP8 approach yields a measured throughput uplift in the 15-25% range. The difference between the 2x theoretical peak and these realized gains is attributed to attention layers staying in BF16 and the extra cost associated with the quantization kernels.
When FP8 is extended beyond linear layers to cover key-value (KV) cache and attention operations, NVIDIA reports an overall performance improvement of approximately 48% relative to BF16-based baselines. In reinforcement learning, however, policy weights are continuously updated, which necessitates dynamic recalibration of quantization scales at every training step. According to NVIDIA, this recalibration introduces around 2-3% overhead, a trade-off characterized as modest in light of the aggregate acceleration.
NVIDIA also evaluated the method on Qwen3-30B, described as a mixture-of-experts model. The FP8 and BF16 configurations exhibited closely aligned accuracy curves, indicating that the technique is applicable across different model architectures.
Implications for Training Costs and Model Quality
Reinforcement learning for models designed for advanced reasoning, such as systems underlying sophisticated AI assistants, entails substantial computational requirements. Within that context, a roughly 48% training speedup directly reduces GPU-hours and power consumption for organizations developing such models.
A key component of this FP8 methodology is an importance sampling technique aimed at preserving accuracy. By adjusting for distribution mismatches between the generation and training models on a per-token basis, it allows the system to adopt more aggressive precision reduction while keeping model quality intact.
Implementation in NeMo RL and Customization Options
NVIDIA states that the complete implementation is available within the open-source NeMo RL library. The release includes pre-configured FP8 recipes for Llama 3.1 8B and Moonlight 16B models, allowing users to deploy the approach with minimal integration work.
For more advanced users, the design supports several customization levers. Practitioners can, for example, preserve selected transformer layers in BF16 if desired, or choose power-of-2 scaling factors to further optimize behavior. These options are intended to help tailor the trade-off between performance and numerical robustness to specific workloads.
For AI infrastructure operators confronting rising compute demands as model complexity increases, NVIDIA positions this FP8 reinforcement learning recipe as a meaningful efficiency enhancement that does not depend on new hardware. Instead, it seeks to extract more value from existing H100 deployments through more efficient use of FP8 capabilities.





