Posit: Tapered Floating-Point (UNUM Type III)
IEEE-754 floating-point distributes its precision uniformly across the entire exponent range. A 32-bit float has the same number of fraction bits whether the value is 10^-30 or 10^30. But most real-world computations cluster their values near 1.0 — neural network weights, probabilities, normalized sensor readings, physical quantities in natural units. IEEE-754 wastes precision on extreme exponents that are rarely used while providing no more precision where it matters most.
Posits, invented by John Gustafson as UNUM Type III, solve this with tapered precision: values near 1.0 get the most fraction bits (highest precision), while values at the extremes of the range trade fraction bits for exponent bits (widest range). The result is a format that provides more useful precision than IEEE-754 at the same bit width, with a cleaner encoding that eliminates the redundancy and complexity of NaN, denormals, and signed zeros.
posit<nbits, es, bt> is a tapered-precision floating-point:
| Parameter | Type | Default | Description |
|---|---|---|---|
nbits | unsigned | — | Total bits (2 to 256+) |
es | unsigned | — | Maximum exponent bits (0 to 9+) |
bt | typename | uint8_t | Storage block type |
Encoding
Section titled “Encoding”A posit encodes four fields in a variable-length format:
[sign : 1] [regime : variable] [exponent : ≤ es] [fraction : remaining]- Sign: 0 = positive, 1 = negative (2’s complement for the entire encoding)
- Regime: run-length encoded scale factor
- Run of k ones followed by a zero: regime value = k (scale = useed^k)
- Run of k zeros followed by a one: regime value = -k (scale = useed^(-k))
- Exponent: up to
esbits of binary exponent - Fraction: remaining bits encode the significand
The useed = 2^(2^es) is the regime scale factor:
es=0: useed = 2 (each regime step doubles/halves)es=1: useed = 4es=2: useed = 16es=3: useed = 256
Tapered Precision
Section titled “Tapered Precision”For values near 1.0, the regime is short (1-2 bits), leaving most bits for fraction (high precision). For extreme values, the regime consumes most bits, providing enormous dynamic range at the cost of precision:
posit<32, 2>: Near 1.0: regime=2 bits, exponent=2 bits, fraction=27 bits (~8 decimal digits) Near max: regime=29 bits, exponent=2 bits, fraction=0 bits (range only)Key Properties
Section titled “Key Properties”- Two special values only: all-zeros = 0, MSB-only = NaR (Not a Real)
- No NaN, no signed zero, no denormals: cleaner than IEEE-754
- Reciprocal symmetry: posit(x) and posit(1/x) have the same precision
- Exact integers: small integers near 1.0 are represented exactly
- More useful precision: typically 1-2 extra decimal digits vs same-size IEEE float
- Fused operations: atomic fma(), fam(), fmma() using blocktriple intermediate
Dynamic Range
Section titled “Dynamic Range”| Configuration | Dynamic Range |
|---|---|
posit<8, 0> | 2^(-6) to 2^6 |
posit<16, 1> | 2^(-28) to 2^28 |
posit<32, 2> | 2^(-120) to 2^120 |
posit<64, 3> | 2^(-496) to 2^496 |
How It Works
Section titled “How It Works”The regime field is the key innovation. It uses run-length encoding to create a variable-width exponent:
- Decode regime: count the run of identical bits after the sign bit
- Compute regime scale: useed^regime_value
- Decode fixed exponent: next
esbits (if available) - Decode fraction: remaining bits form the significand with an implicit leading 1
The value is: (-1)^sign × useed^regime × 2^exponent × (1 + fraction)
This variable-length structure is what creates tapered precision: when the regime is short (value near 1), more bits remain for fraction; when the regime is long (extreme value), precision is sacrificed for range.
Arithmetic uses a blocktriple intermediate representation that carries sufficient precision for exact computation before rounding back to the target posit format.
How to Use It
Section titled “How to Use It”Include
Section titled “Include”#include <universal/number/posit/posit.hpp>using namespace sw::universal;Basic Usage
Section titled “Basic Usage”posit<32, 2> a(3.14159265358979);posit<32, 2> b(2.71828182845905);posit<32, 2> c = a * b;
std::cout << "pi * e = " << c << std::endl;std::cout << "encoding: " << to_binary(c) << std::endl;std::cout << "components: " << components(c) << std::endl;Comparing Precision with IEEE-754
Section titled “Comparing Precision with IEEE-754”// posit<32,2> has more useful precision than float32 near 1.0float f = 1.0f;posit<32, 2> p(1.0);
for (int i = 0; i < 10; ++i) { f += 1.0e-7f; p += posit<32, 2>(1.0e-7);}// posit preserves more digits near 1.0 due to tapered precisionMixed-Precision Deep Learning
Section titled “Mixed-Precision Deep Learning”template<typename Real>Real sigmoid(Real x) { return Real(1) / (Real(1) + exp(-x));}
// 8-bit posit for inference: more precision near 0-1 range than fp8using Inference = posit<8, 0>;Inference weight(0.5);Inference input(0.3);Inference output = sigmoid(weight * input);Plug-in Replacement
Section titled “Plug-in Replacement”template<typename Real>Real newton_sqrt(Real x, int iterations = 10) { Real guess = x / Real(2); for (int i = 0; i < iterations; ++i) { guess = (guess + x / guess) / Real(2); } return guess;}
// Compare convergence: posit vs floatauto sqrt_f = newton_sqrt(2.0f);auto sqrt_p = newton_sqrt(posit<32, 2>(2.0));// posit converges to more accurate result at same bit widthQuire: Exact Dot Products
Section titled “Quire: Exact Dot Products”#include <universal/number/posit/posit.hpp>// The quire provides exact accumulation with no intermediate roundingusing Posit = posit<32, 2>;
// Fused dot product: exact accumulation in quire, single rounding at the endPosit a[] = { Posit(1.0), Posit(2.0), Posit(3.0) };Posit b[] = { Posit(4.0), Posit(5.0), Posit(6.0) };
quire<32, 2> q;for (int i = 0; i < 3; ++i) { q += quire_mul(a[i], b[i]); // No rounding between multiplications}Posit dot_product;convert(q.to_value(), dot_product); // Single rounding at the endProblems It Solves
Section titled “Problems It Solves”| Problem | How posit Solves It |
|---|---|
| IEEE float wastes precision on extreme values | Tapered precision: most bits near 1.0 where values cluster |
| NaN propagation corrupts entire computation | Only one non-real value (NaR), explicit and predictable |
| Signed zero and multiple NaN encodings waste bit patterns | Two special values only: 0 and NaR |
| 8-bit float has too little precision for ML inference | posit<8,0> has more precision near 0-1 than fp8 |
| Dot product rounding errors accumulate | Quire provides exact accumulation, single final rounding |
| Need better precision without increasing bit width | posit<32,2> typically provides 1-2 more decimal digits than float32 |
| Reproducible linear algebra across platforms | Quire + posit = same result on every platform |