E8M0: Exponent-Only Power-of-Two Scale Factor

Why

Block-scaled number formats (like OCP MX and NVIDIA NVFP4) need a compact scale factor that can be applied to a group of low-precision elements. This scale factor must be:

Compact: 1 byte per block (not per element)
Efficient to apply: scaling by a power of 2 is a simple bit shift, not a multiply
Wide range: must cover the full dynamic range of the data being scaled

The e8m0 type is an 8-bit exponent-only format: no sign bit, no mantissa, pure power-of-two encoding. It represents values of the form 2^(encoding - 127), covering a range from 2^-127 (~5.9 × 10^-39) to 2^127 (~1.7 × 10^38). Scaling by an e8m0 value is a bit shift, not a floating-point multiply.

What

e8m0 is a fixed-format 8-bit exponent-only type:

Property	Value
Total bits	8
Sign bits	0 (always positive)
Mantissa bits	0 (no fractional part)
Exponent bits	8
Bias	127
Value	2^(encoding - 127)
Range	2^-127 to 2^127
Special	0xFF = NaN

Key Properties

Pure power-of-two: every value is an exact power of 2
No sign bit: always positive (scale factors are magnitudes)
No mantissa: maximum dynamic range for a single byte
Shift-based arithmetic: scaling = exponent addition (integer add on bytes)
Trivially copyable: single uint8_t storage
NaN marker: encoding 0xFF is reserved for NaN

Value Table (Selected)

Encoding	Value
0	2^-127 ≈ 5.88 × 10^-39
64	2^-63 ≈ 1.08 × 10^-19
127	2^0 = 1.0
128	2^1 = 2.0
190	2^63 ≈ 9.22 × 10^18
254	2^127 ≈ 1.70 × 10^38
255	NaN

How It Works

The encoding is trivially simple: the stored byte minus the bias (127) gives the exponent.

value = 2^(stored_byte - 127)

Multiplication of an e8m0 by another e8m0 is addition of their stored bytes (with bias adjustment). Applying an e8m0 scale to a floating-point value is adding the e8m0 exponent to the float’s exponent field — equivalent to ldexp(value, exponent) or a simple bit shift for fixed-point values.

How to Use It

Include

#include <universal/number/e8m0/e8m0.hpp>
using namespace sw::universal;

Basic Usage

e8m0 scale(1.0f);     // Encoding 127, value = 2^0 = 1.0
e8m0 scale2(256.0f);  // Encoding 135, value = 2^8 = 256.0

std::cout << "scale: " << scale << " encoding: " << to_binary(scale) << std::endl;
std::cout << "scale2: " << scale2 << " encoding: " << to_binary(scale2) << std::endl;

As Block Scale Factor

// e8m0 is the scale type for OCP MX blocks
#include <universal/number/mxfloat/mxfloat.hpp>

// MX block: 32 elements sharing one e8m0 scale
mxblock<e4m3, 32> block;
// The block internally uses e8m0 as the shared scale factor

// Quantize float data into the block
std::vector<float> data(32);
// ... fill data ...
block.quantize(data);
// The e8m0 scale captures the block's dynamic range
// Elements are scaled relative to this power-of-two

Dynamic Range Inspection

e8m0 val;
for (unsigned i = 0; i < 256; ++i) {
    val.setbits(i);
    if (i < 255) {
        std::cout << "encoding " << i << ": 2^" << (int(i) - 127)
                  << " = " << val << std::endl;
    } else {
        std::cout << "encoding 255: NaN" << std::endl;
    }
}

Problems It Solves

Problem	How e8m0 Solves It
Block formats need a compact scale factor	1 byte per block, covers 10^-39 to 10^38
Floating-point multiply for scaling is expensive	Power-of-2 scaling is a bit shift
Scale factor must cover full data dynamic range	8-bit exponent spans 254 orders of magnitude
Hardware needs simple, fixed-format scale encoding	No sign, no mantissa, trivial decode
OCP MX specification compliance	Direct implementation of OCP MX v1.0 scale type
Memory-efficient metadata for quantized tensors	1 byte overhead per 32 elements