NeMo RL FP8 Boosts Faster, Efficient RL Training Framework

Written by Sandra Leggero

Sandra Leggero has a background in financial markets, having spent more than 9 years in commodities trading for several European and Asian companies. She holds a degree in Economics from the University of Pavia and specializes in emerging markets.

, | Updated: April 21, 2026

Key Moments

NVIDIA reports up to 48% higher training throughput for reinforcement learning using an end-to-end FP8 precision recipe versus BF16 baselines.
Validation accuracy for Llama 3.1 8B Instruct with full FP8 usage reached 0.613, closely tracking the BF16 result of 0.616.
Dynamic quantization scale recalibration adds an estimated 2-3% overhead but enables substantial performance gains across models, including Qwen3-30B.

Aligning Precision Across RL Generation and Training

NVIDIA has introduced a detailed FP8 precision methodology for reinforcement learning that targets up to 48% faster training throughput while maintaining accuracy comparable to BF16-based workflows. The approach, outlined in a technical blog by NVIDIA’s Guyue Huang, is positioned as a lever for reducing AI infrastructure costs and improving the economics of GPU compute.

The work focuses on a longstanding challenge in reinforcement learning pipelines: discrepancies between the generation and training stages when they operate at different numerical precisions and rely on separate execution engines. In conventional RL setups, vLLM is used for rollouts and Megatron Core for training, each employing distinct CUDA kernels. At lower precision formats, these implementation differences can accumulate and lead to notable numerical divergence, which has historically limited the practical use of FP8.

Unified FP8 Strategy and Accuracy Outcomes

NVIDIA’s strategy is to apply FP8 uniformly across both the generation and training components instead of mixing precision levels between them. In experiments with the Llama 3.1 8B Instruct model, end-to-end FP8 usage produced a validation accuracy of 0.613, compared with 0.616 for a BF16 configuration. By contrast, enabling FP8 only for the generation phase lowered accuracy to 0.586, underscoring the impact of consistent precision management throughout the RL loop.

The FP8 recipe employs block-wise quantized FP8 in the E4M3 format. Quantization granularity is set to 128×128 for weights and 1×128 for activations. Under this configuration, linear layers are executed with FP8 arithmetic and are designed to reach 2x theoretical peak throughput relative to BF16. Other components such as attention mechanisms, normalization layers, and non-linear operations remain in BF16 precision.

Component / Setting	Precision	Details / Result
Llama 3.1 8B Instruct – BF16	BF16	Validation accuracy 0.616
Llama 3.1 8B Instruct – End-to-end FP8	FP8 (E4M3)	Validation accuracy 0.613
Llama 3.1 8B Instruct – FP8 generation only	Mixed (FP8/BF16)	Validation accuracy 0.586
Linear layers	FP8	Up to 2x theoretical throughput vs BF16
Attention, normalization, non-linear ops	BF16	Remain in BF16 precision

Performance Characteristics and Overheads

Focusing on linear layers alone, the FP8 approach yields a measured throughput uplift in the 15-25% range. The difference between the 2x theoretical peak and these realized gains is attributed to attention layers staying in BF16 and the extra cost associated with the quantization kernels.

When FP8 is extended beyond linear layers to cover key-value (KV) cache and attention operations, NVIDIA reports an overall performance improvement of approximately 48% relative to BF16-based baselines. In reinforcement learning, however, policy weights are continuously updated, which necessitates dynamic recalibration of quantization scales at every training step. According to NVIDIA, this recalibration introduces around 2-3% overhead, a trade-off characterized as modest in light of the aggregate acceleration.

NVIDIA also evaluated the method on Qwen3-30B, described as a mixture-of-experts model. The FP8 and BF16 configurations exhibited closely aligned accuracy curves, indicating that the technique is applicable across different model architectures.

Implications for Training Costs and Model Quality

Reinforcement learning for models designed for advanced reasoning, such as systems underlying sophisticated AI assistants, entails substantial computational requirements. Within that context, a roughly 48% training speedup directly reduces GPU-hours and power consumption for organizations developing such models.

A key component of this FP8 methodology is an importance sampling technique aimed at preserving accuracy. By adjusting for distribution mismatches between the generation and training models on a per-token basis, it allows the system to adopt more aggressive precision reduction while keeping model quality intact.

Implementation in NeMo RL and Customization Options

NVIDIA states that the complete implementation is available within the open-source NeMo RL library. The release includes pre-configured FP8 recipes for Llama 3.1 8B and Moonlight 16B models, allowing users to deploy the approach with minimal integration work.

For more advanced users, the design supports several customization levers. Practitioners can, for example, preserve selected transformer layers in BF16 if desired, or choose power-of-2 scaling factors to further optimize behavior. These options are intended to help tailor the trade-off between performance and numerical robustness to specific workloads.

For AI infrastructure operators confronting rising compute demands as model complexity increases, NVIDIA positions this FP8 reinforcement learning recipe as a meaningful efficiency enhancement that does not depend on new hardware. Instead, it seeks to extract more value from existing H100 deployments through more efficient use of FP8 capabilities.