Classic Floating-Point (cfloat): Configurable IEEE-754 Compatible Arithmetic
IEEE-754 defines a handful of fixed formats: half (16-bit), single (32-bit), double (64-bit), and quad (128-bit). But modern workloads — especially deep learning, DSP, and mixed-precision HPC — need floating-point formats that don’t exist in the standard: 8-bit floats for inference, 24-bit floats for AMD GPUs, formats with max-exponent value numbers for gradual overflow, or saturating arithmetic instead of infinity.
The Universal cfloat type is a fully parameterized floating-point that can emulate any IEEE-754 format and extend beyond it. You configure the total bit width, exponent size, subnormal behavior, max-exponent value behavior, and overflow semantics at compile time. The result is a type that behaves exactly like hardware floating-point but with the precision and range you choose.
cfloat<nbits, es, bt, hasSubnormals, hasMaxExpValues, isSaturating> is a configurable floating-point:
| Parameter | Type | Default | Description |
|---|---|---|---|
nbits | unsigned | — | Total bits (4 to 256+) |
es | unsigned | — | Exponent bits |
bt | typename | uint8_t | Storage block type |
hasSubnormals | bool | — | Enable gradual underflow |
hasMaxExpValues | bool | — | Reclaim max-exponent encodings as numeric values |
isSaturating | bool | — | Overflow saturates instead of producing infinity |
Encoding
Section titled “Encoding”Standard sign-exponent-fraction layout:
[sign : 1 bit] [exponent : es bits] [fraction : nbits - 1 - es bits]- Exponent bias:
2^(es-1) - 1 - Hidden bit: 1 for normal numbers, 0 for subnormals
- Infinity: exponent all-1s, fraction all-0s (when
hasMaxExpValuesis false) - NaN: exponent all-1s, fraction non-zero (when
hasMaxExpValuesis false) - When
hasMaxExpValues: the all-1s exponent binade encodes numeric values; only 4 encodings are reserved for +-inf and signalling/quiet NaN - When
isSaturating: overflow produces maxpos/maxneg instead of infinity
Standard IEEE-754 Aliases
Section titled “Standard IEEE-754 Aliases”using fp8e2m5 = cfloat<8, 2, uint8_t, true, false, false>; // quarter precisionusing fp8e3m4 = cfloat<8, 3, uint8_t, true, false, false>; // ML trainingusing fp8e4m3 = cfloat<8, 4, uint8_t, true, false, false>; // NVIDIA FP8using fp8e5m2 = cfloat<8, 5, uint8_t, true, false, false>; // NVIDIA FP8using half = cfloat<16, 5, uint16_t, true, false, false>; // IEEE-754 binary16using single = cfloat<32, 8, uint32_t, true, false, false>; // IEEE-754 binary32using duble = cfloat<64, 11, uint32_t, true, false, false>; // IEEE-754 binary64using quad = cfloat<128, 15, uint32_t, true, false, false>; // IEEE-754 binary128Deep Learning Format Aliases
Section titled “Deep Learning Format Aliases”using bfloat_t = cfloat<16, 8, uint16_t, true, false, false>; // Google Brain floatusing msfp8 = cfloat<8, 2, uint8_t, false, false, false>; // Microsoft FP8using amd24 = cfloat<24, 8, uint32_t, true, false, false>; // AMD 24-bitHow It Works
Section titled “How It Works”Arithmetic is performed using a blocktriple intermediate representation that carries sufficient precision to compute exact products and sums before rounding. The pipeline is:
- Decode operands into blocktriple (sign, scale, significant)
- Compute in extended precision (the blocktriple is wider than the target)
- Round using round-to-nearest-even (or configurable rounding mode)
- Encode back into cfloat format
This mirrors how hardware FPUs work internally but is fully parameterized at the template level.
How to Use It
Section titled “How to Use It”Include
Section titled “Include”#include <universal/number/cfloat/cfloat.hpp>using namespace sw::universal;Custom 8-bit Float for Deep Learning
Section titled “Custom 8-bit Float for Deep Learning”// FP8 with 4 exponent bits, 3 fraction bits, saturatingusing fp8 = cfloat<8, 4, uint8_t, true, false, true>;
fp8 weight(0.5f);fp8 activation(0.75f);fp8 result = weight * activation; // Saturates on overflow, no infinity
// Explore the encodingstd::cout << to_binary(result) << " = " << result << std::endl;std::cout << "maxpos: " << fp8::maxpos() << std::endl;std::cout << "minpos: " << fp8::minpos() << std::endl;Mixed-Precision Algorithm Development
Section titled “Mixed-Precision Algorithm Development”template<typename HighPrec, typename LowPrec>HighPrec mixed_precision_dot(const std::vector<LowPrec>& a, const std::vector<LowPrec>& b) { HighPrec sum(0); for (size_t i = 0; i < a.size(); ++i) { sum += HighPrec(a[i]) * HighPrec(b[i]); // accumulate in high precision } return sum;}
using FP8 = cfloat<8, 4, uint8_t, true, false, false>;using FP32 = cfloat<32, 8, uint32_t, true, false, false>;
std::vector<FP8> weights = { FP8(0.5), FP8(0.25), FP8(0.125) };std::vector<FP8> inputs = { FP8(1.0), FP8(2.0), FP8(3.0) };FP32 result = mixed_precision_dot<FP32>(weights, inputs);Full encoding efficiency with hasMaxExpValues
Section titled “Full encoding efficiency with hasMaxExpValues”// IEEE-754 configurationconstexpr bool hasSubnormals = true;constexpr bool hasMaxExpValues = false;constexpr bool isSaturating = false;// cfloat<nbits, es, BlockType, hasSubnormals, hasMaxExpValues, isSaturating>;
// Standard: subnormals, overflow -> infinity, last binade used for special encodings: inf and NaNusing Standard = cfloat<8, 4>; // equivalent to cfloat<8, 4, uint8_t, true, false, false>;
// Expanded: subnormals, overflow -> infinity, last binade encodes values extending normal rangeusing Expanded = cfloat<8, 4, uint8_t, true, true, false>;
// Saturating: subnormals, no overflow, last binade encodes values extending normal rangeusing Saturated = cfloat<8, 4, uint8_t, true, true, true>;
Standard s(200.0f); // may produce infinityExpanded e(200.0f); // uses max-exponent value encoding, can still produce infiniteSaturated c(200.0f); // uses max-exponent value encoding, stays finiteProblems It Solves
Section titled “Problems It Solves”| Problem | How cfloat Solves It |
|---|---|
| Need 8-bit float for ML inference but IEEE has no 8-bit format | cfloat<8, es> with configurable exponent width |
| Infinity corrupts neural network training | isSaturating=true clamps to maxpos |
| Exploring precision/range trade-offs for new hardware | Any combination of nbits and es |
| Need IEEE-754 semantics in software for testing | Exact emulation of half/single/double/quad |
| Custom float for FPGA/ASIC design exploration | Parameterized at compile time, matches hardware encoding |
| Subnormal handling is too slow on some hardware | hasSubnormals=false to disable gradual underflow |
| Need wider-than-quad precision (128-bit+) | cfloat<256, 19> for octuple precision |