NVBlock: NVIDIA NVFP4 Block Format for GPU Inference
NVIDIA’s AI accelerators (Hopper, Blackwell) need a quantization format optimized for their hardware architecture. While OCP MX uses 32-element blocks with a power-of-two scale (e8m0), NVIDIA’s NVFP4 format uses 16-element blocks with a fractional scale (e4m3) and an additional tensor-level scale. This design achieves finer granularity than MX’s power-of-two scaling, resulting in lower quantization error for the same compression ratio, at the cost of slightly more complex scale computation.
The nvblock type implements NVIDIA’s two-level scaling architecture, enabling software validation and benchmarking of NVFP4-quantized models before deployment on GPU hardware.
nvblock<ElementType, BlockSize, ScaleType> is a two-level block-scaled format:
| Parameter | Type | Default | Description |
|---|---|---|---|
ElementType | typename | e2m1 | Element type (e2m1 for NVFP4) |
BlockSize | size_t | 16 | Elements per block (NVIDIA default: 16) |
ScaleType | typename | e4m3 | Block scale type (fractional, not power-of-two) |
Standard Configuration
Section titled “Standard Configuration”using nvfp4 = nvblock<e2m1, 16, e4m3>; // Canonical NVFP4Two-Level Scaling Architecture
Section titled “Two-Level Scaling Architecture”Tensor scale (float32, external) └─── Block scale (e4m3, per block of 16 elements) └─── Elements (e2m1, per value)- Tensor scale: float32 value, one per tensor or per layer
- Block scale: e4m3 value (fractional precision), one per 16 elements
- Elements: e2m1 values (4-bit), one per weight/activation
Key Differences from OCP MX
Section titled “Key Differences from OCP MX”| Property | OCP MX (mxblock) | NVIDIA (nvblock) |
|---|---|---|
| Block size | 32 | 16 |
| Scale type | e8m0 (power-of-two) | e4m3 (fractional) |
| Scale precision | 254 power-of-2 values | ~256 fractional values |
| Scale computation | floor(log2(amax)) | round_to_e4m3(amax / elem_max) |
| External scale | None | Tensor-level float32 |
| RMSE for same data | Higher | Lower (finer granularity) |
Key Properties
Section titled “Key Properties”- Fractional block scale: e4m3 provides finer granularity than power-of-two
- Two-level scaling: tensor scale + block scale = better dynamic range
- Smaller blocks (16): better spatial locality for GPU cache
- NVIDIA hardware-native: aligned with Tensor RT and GPU warp operations
- Lower quantization error: consistently lower RMSE than MX at same bit width
How It Works
Section titled “How It Works”Quantization (float → nvblock)
Section titled “Quantization (float → nvblock)”- Pre-divide inputs by tensor_scale:
prescaled[i] = input[i] / tensor_scale - Find absolute maximum of the prescaled block:
amax = max(|prescaled[i]|) - Compute raw scale:
raw_scale = amax / elem_max(where elem_max is the maximum e2m1 value) - Round to nearest e4m3:
block_scale = round_to_e4m3(raw_scale)(not floor!) - Scale each element:
element[i] = round_to_e2m1(prescaled[i] / block_scale)
Dequantization (nvblock → float)
Section titled “Dequantization (nvblock → float)”output[i] = tensor_scale × block_scale × element[i]Dot Product
Section titled “Dot Product”result = tensor_scale_a × tensor_scale_b × Σ(block_scale_a × block_scale_b × element_a[i] × element_b[i])The tensor scales can be factored out of the inner loop for efficiency.
How to Use It
Section titled “How to Use It”Include
Section titled “Include”#include <universal/number/nvblock/nvblock.hpp>using namespace sw::universal;Quantize and Dequantize
Section titled “Quantize and Dequantize”nvblock<e2m1, 16, e4m3> block;float tensor_scale = 1.0f; // Often computed per-layer
std::vector<float> weights(16);// ... fill weights ...
block.quantize(weights, tensor_scale);
std::vector<float> reconstructed(16);block.dequantize(reconstructed, tensor_scale);Tensor-Scaled Dot Product
Section titled “Tensor-Scaled Dot Product”nvblock<e2m1, 16, e4m3> block_a, block_b;float ts_a = 1.0f, ts_b = 1.0f; // Tensor scales
std::vector<float> a(16), b(16);// ... fill a and b ...block_a.quantize(a, ts_a);block_b.quantize(b, ts_b);
float dot = block_a.dot(block_b, ts_a, ts_b);Comparing with OCP MX
Section titled “Comparing with OCP MX”#include <universal/number/mxfloat/mxfloat.hpp>#include <universal/number/nvblock/nvblock.hpp>
std::vector<float> data(32);// ... fill with neural network weights ...
// OCP MX: 32 elements, e8m0 power-of-two scalemxblock<e2m1, 32> mx_block;mx_block.quantize(data);
// NVIDIA: 16 elements, e4m3 fractional scale (process two blocks)nvblock<e2m1, 16, e4m3> nv_block_lo, nv_block_hi;std::vector<float> lo(data.begin(), data.begin() + 16);std::vector<float> hi(data.begin() + 16, data.end());nv_block_lo.quantize(lo, 1.0f);nv_block_hi.quantize(hi, 1.0f);
// NVIDIA typically achieves lower RMSE due to fractional scale granularityProblems It Solves
Section titled “Problems It Solves”| Problem | How nvblock Solves It |
|---|---|
| MX’s power-of-two scale is too coarse | e4m3 fractional scale provides finer granularity |
| Need NVIDIA GPU-native quantization format | Direct implementation of NVFP4 specification |
| Single-level scale can’t capture both layer-wide and local dynamics | Two-level scaling (tensor + block) |
| 32-element blocks don’t match GPU warp size | 16-element blocks align with GPU cache lines |
| Software validation of quantized models before GPU deployment | Bit-exact emulation of hardware quantization |
| Quantization RMSE too high with MX format | Fractional scale consistently produces lower error |