Skip to content

E8M0: Exponent-Only Power-of-Two Scale Factor

Block-scaled number formats (like OCP MX and NVIDIA NVFP4) need a compact scale factor that can be applied to a group of low-precision elements. This scale factor must be:

  • Compact: 1 byte per block (not per element)
  • Efficient to apply: scaling by a power of 2 is a simple bit shift, not a multiply
  • Wide range: must cover the full dynamic range of the data being scaled

The e8m0 type is an 8-bit exponent-only format: no sign bit, no mantissa, pure power-of-two encoding. It represents values of the form 2^(encoding - 127), covering a range from 2^-127 (~5.9 × 10^-39) to 2^127 (~1.7 × 10^38). Scaling by an e8m0 value is a bit shift, not a floating-point multiply.

e8m0 is a fixed-format 8-bit exponent-only type:

PropertyValue
Total bits8
Sign bits0 (always positive)
Mantissa bits0 (no fractional part)
Exponent bits8
Bias127
Value2^(encoding - 127)
Range2^-127 to 2^127
Special0xFF = NaN
  • Pure power-of-two: every value is an exact power of 2
  • No sign bit: always positive (scale factors are magnitudes)
  • No mantissa: maximum dynamic range for a single byte
  • Shift-based arithmetic: scaling = exponent addition (integer add on bytes)
  • Trivially copyable: single uint8_t storage
  • NaN marker: encoding 0xFF is reserved for NaN
EncodingValue
02^-127 ≈ 5.88 × 10^-39
642^-63 ≈ 1.08 × 10^-19
1272^0 = 1.0
1282^1 = 2.0
1902^63 ≈ 9.22 × 10^18
2542^127 ≈ 1.70 × 10^38
255NaN

The encoding is trivially simple: the stored byte minus the bias (127) gives the exponent.

value = 2^(stored_byte - 127)

Multiplication of an e8m0 by another e8m0 is addition of their stored bytes (with bias adjustment). Applying an e8m0 scale to a floating-point value is adding the e8m0 exponent to the float’s exponent field — equivalent to ldexp(value, exponent) or a simple bit shift for fixed-point values.

#include <universal/number/e8m0/e8m0.hpp>
using namespace sw::universal;
e8m0 scale(1.0f); // Encoding 127, value = 2^0 = 1.0
e8m0 scale2(256.0f); // Encoding 135, value = 2^8 = 256.0
std::cout << "scale: " << scale << " encoding: " << to_binary(scale) << std::endl;
std::cout << "scale2: " << scale2 << " encoding: " << to_binary(scale2) << std::endl;
// e8m0 is the scale type for OCP MX blocks
#include <universal/number/mxfloat/mxfloat.hpp>
// MX block: 32 elements sharing one e8m0 scale
mxblock<e4m3, 32> block;
// The block internally uses e8m0 as the shared scale factor
// Quantize float data into the block
std::vector<float> data(32);
// ... fill data ...
block.quantize(data);
// The e8m0 scale captures the block's dynamic range
// Elements are scaled relative to this power-of-two
e8m0 val;
for (unsigned i = 0; i < 256; ++i) {
val.setbits(i);
if (i < 255) {
std::cout << "encoding " << i << ": 2^" << (int(i) - 127)
<< " = " << val << std::endl;
} else {
std::cout << "encoding 255: NaN" << std::endl;
}
}
ProblemHow e8m0 Solves It
Block formats need a compact scale factor1 byte per block, covers 10^-39 to 10^38
Floating-point multiply for scaling is expensivePower-of-2 scaling is a bit shift
Scale factor must cover full data dynamic range8-bit exponent spans 254 orders of magnitude
Hardware needs simple, fixed-format scale encodingNo sign, no mantissa, trivial decode
OCP MX specification complianceDirect implementation of OCP MX v1.0 scale type
Memory-efficient metadata for quantized tensors1 byte overhead per 32 elements