MX Float: OCP Microscaling Block Format for AI
Large language models contain billions of parameters. Storing each in 32-bit float costs hundreds of gigabytes — far more than fits in GPU memory. Naive per-element quantization to 4 or 8 bits can lose too much precision, especially for outlier values that are critical to model accuracy.
The OCP (Open Compute Project) Microscaling (MX) format solves this by sharing a single scale factor across a block of elements. One e8m0 scale value (a power-of-two exponent in a single byte) captures the block’s magnitude, while each element stores only its relative value in 4-8 bits. This achieves 4-8x compression with controlled precision loss: the shared scale preserves the overall magnitude, and individual elements capture the relative distribution within each block.
mxblock<ElementType, BlockSize> is a block-scaled number format:
| Parameter | Type | Default | Description |
|---|---|---|---|
ElementType | typename | — | Microfloat element type (e2m1, e2m3, e3m2, e4m3, e5m2, or int8_t) |
BlockSize | size_t | 32 | Elements per block (OCP default: 32) |
Standard MX Format Aliases
Section titled “Standard MX Format Aliases”| Alias | Element | Block Size | Bits/Element | Compression vs FP32 |
|---|---|---|---|---|
mxfp4 | e2m1 | 32 | 4 | ~8x |
mxfp6 | e3m2 | 32 | 6 | ~5x |
mxfp6e2m3 | e2m3 | 32 | 6 | ~5x |
mxfp8 | e4m3 | 32 | 8 | ~4x |
mxfp8e5m2 | e5m2 | 32 | 8 | ~4x |
mxint8 | int8_t | 32 | 8 | ~4x |
Block Structure
Section titled “Block Structure”[e8m0 scale : 8 bits] [element[0] : N bits] [element[1] : N bits] ... [element[31] : N bits]Total storage per block: 8 + BlockSize × element_bits
Key Properties
Section titled “Key Properties”- Shared scale factor: one e8m0 per block, amortized over 32 elements
- Power-of-two scale:
floor(log2(amax))— bit shift, not multiply - OCP MX v1.0 compliant: industry standard for AI model quantization
- Block-level quantization: preserves relative precision within each block
- Deterministic: quantize/dequantize is bit-exact and reproducible
How It Works
Section titled “How It Works”Quantization (float → mxblock)
Section titled “Quantization (float → mxblock)”- Find the absolute maximum of the input block:
amax = max(|input[i]|) - Compute shared exponent:
shared_exp = floor(log2(amax)) - Clamp to e8m0 range:
scale = clamp(shared_exp, -127, 127) - Scale each element:
element[i] = input[i] / 2^scale - Round each scaled element to the nearest ElementType value
Dequantization (mxblock → float)
Section titled “Dequantization (mxblock → float)”output[i] = scale × element[i]where scale is the e8m0 value (a power of 2).
Dot Product
Section titled “Dot Product”Dot product of two mxblocks accumulates in FP32:
result = scale_a × scale_b × Σ(element_a[i] × element_b[i])How to Use It
Section titled “How to Use It”Include
Section titled “Include”#include <universal/number/mxfloat/mxfloat.hpp>using namespace sw::universal;Quantize and Dequantize
Section titled “Quantize and Dequantize”mxblock<e4m3, 32> block;
// Quantize float datastd::vector<float> weights = { 0.5f, -1.25f, 3.14f, /* ... 32 values */ };block.quantize(weights);
// Dequantize back to floatstd::vector<float> reconstructed(32);block.dequantize(reconstructed);
// Measure quantization errorfor (size_t i = 0; i < 32; ++i) { float error = std::abs(weights[i] - reconstructed[i]); std::cout << "element " << i << ": error = " << error << std::endl;}Block Dot Product
Section titled “Block Dot Product”mxblock<e4m3, 32> block_a, block_b;
std::vector<float> a(32), b(32);// ... fill a and b with weight/activation data ...block_a.quantize(a);block_b.quantize(b);
float dot = block_a.dot(block_b);std::cout << "Block dot product: " << dot << std::endl;Comparing MX Format Variants
Section titled “Comparing MX Format Variants”std::vector<float> data(32);// ... fill with neural network weights ...
mxblock<e2m1, 32> mx4; // 4-bit: maximum compressionmxblock<e4m3, 32> mx8; // 8-bit: balanced precision/sizemxblock<e5m2, 32> mx8w; // 8-bit: wider range, less precision
mx4.quantize(data);mx8.quantize(data);mx8w.quantize(data);
// Compare RMSE for each formatstd::cout << "mxfp4 RMSE: " << mx4.rmse(data) << std::endl;std::cout << "mxfp8 RMSE: " << mx8.rmse(data) << std::endl;std::cout << "mxfp8e5m2 RMSE:" << mx8w.rmse(data) << std::endl;Problems It Solves
Section titled “Problems It Solves”| Problem | How mxfloat Solves It |
|---|---|
| LLM weights don’t fit in GPU memory | 4-8x compression with shared block scale |
| Per-element quantization loses outlier values | Block scale captures magnitude, elements capture distribution |
| Memory bandwidth limits inference throughput | Fewer bytes per element = higher effective bandwidth |
| Need industry-standard quantization for accelerator compatibility | OCP MX v1.0 compliant |
| Quantization error varies unpredictably | Deterministic quantize/dequantize, measurable RMSE |
| Model deployment on edge devices with limited memory | Extreme compression (mxfp4 = 8x vs FP32) |