Skip to content

The Kulisch Super-Accumulator: Exact Dot Products for Reliable Numerical Computing

A Practitioner’s Guide to Reproducible Linear Algebra with the Universal Numbers Library

Theodore Omtzigt, Stillwater Supercomputing, Inc.


The dot product is the most fundamental operation in numerical computing. It underlies matrix multiplication, iterative solvers, neural-network inference, signal processing, and virtually every algorithm in computational science. Yet in IEEE 754 floating-point arithmetic, the dot product is also the primary source of precision loss, non-reproducibility, and numerical failure.

In the 1970s, Ulrich Kulisch and Willard Miranker showed that a single architectural addition — a fixed-point super-accumulator wide enough to hold the exact sum of any number of floating-point products — eliminates these problems at their root. Their insight was implemented in the XSC family of languages and in dedicated hardware, but never adopted by mainstream processors or programming languages. As a result, the insight has been largely forgotten by a generation of computational scientists.

This document reconstructs the why, what, and how of the Kulisch super-accumulator, provides the historical context that motivated it, and shows how the Universal Numbers Library’s generalized quire makes exact dot products available today — as a header-only C++ template that works with any number system: IEEE-754 floats, posits, fixed-point, logarithmic, and double-base number systems.


1. The Problem: Why Floating-Point Dot Products Fail

Section titled “1. The Problem: Why Floating-Point Dot Products Fail”

Every floating-point dot product suffers from three interrelated pathologies:

  1. Catastrophic cancellation. When large terms of opposite sign nearly cancel, the small residual that remains is dominated by the rounding errors accumulated in the large terms. The true answer is lost.

  2. Non-reproducibility. Because floating-point addition is not associative, reordering the summation — by parallelism, vectorization, or even different compiler optimization levels — changes the result. The same program on different hardware produces different answers.

  3. Stalled convergence in iterative refinement. Algorithms that improve an approximate solution by computing the residual r = b − Ax depend on the residual being accurate. When the residual itself is computed with a lossy dot product, refinement stalls or diverges, and ill-conditioned systems become unsolvable.

1.2 A Concrete Example: Catastrophic Cancellation

Section titled “1.2 A Concrete Example: Catastrophic Cancellation”

Consider two vectors that produce large intermediate products which nearly cancel:

a = [ 3.2×10⁸, 1, −1, 8×10⁷ ]
b = [ 4.0×10⁷, 1, −1, −1.6×10⁸ ]

The exact dot product is:

a·b = (3.2×10⁸)(4.0×10⁷) + (1)(1) + (−1)(−1) + (8×10⁷)(−1.6×10⁸)
= 1.28×10¹⁶ + 1 + 1 − 1.28×10¹⁶
= 2

But the two dominant products, 1.28×10¹⁶ and −1.28×10¹⁶, consume all available significand bits when accumulated. The residual “+2” is below the rounding threshold and is lost. IEEE single-precision reports 0; IEEE double-precision may report 0 or a value far from 2, depending on evaluation order.

This is not an academic curiosity. It is the normal operating condition of iterative solvers, Gram-Schmidt orthogonalization, and any algorithm that computes differences of nearly-equal quantities.

The most consequential real-world failure occurs in iterative refinement for solving linear systems Ax = b:

  1. Compute an approximate solution x̃ (e.g., via LU factorization).
  2. Compute the residual r = b − Ax̃.
  3. Solve Ad = r for the correction.
  4. Update x̃ ← x̃ + d.
  5. Repeat until convergence.

The critical step is (2): computing the residual. This is a dot product of each row of A with x̃, subtracted from b. For ill-conditioned systems (condition number κ >> 1), Ax̃ is very close to b, and the residual is a small difference of large quantities — precisely the catastrophic cancellation scenario.

With standard floating-point, the residual is inaccurate, refinement stalls, and convergence fails for condition numbers above ~10⁸ in single precision or ~10¹⁶ in double precision. With an exact dot product, the residual is computed without any rounding error, and convergence proceeds even for arbitrarily ill-conditioned systems.

The Hilbert matrix — where H(i,j) = 1/(i+j−1) — is the canonical test case. Its condition number grows super-exponentially:

DimensionCondition Number
5~4.8×10⁵
10~1.6×10¹³
15~3.7×10¹⁷
20~1.1×10²⁸

Standard double-precision iterative refinement fails for n ≥ 12. With exact residual computation via the super-accumulator, refinement converges for any dimension.


2. The Solution: Kulisch’s Exact Dot Product

Section titled “2. The Solution: Kulisch’s Exact Dot Product”

The idea that computer arithmetic should guarantee mathematical correctness — not just approximate it — was articulated by Ulrich W. Kulisch beginning in the late 1960s at the University of Karlsruhe (now Karlsruhe Institute of Technology).

Key milestones:

YearEvent
1969Kulisch begins work on complete arithmetic at Karlsruhe
1976Kulisch & Miranker, “Arithmetic operations of digital computers with a more complete number system” — lays theoretical foundation
1981Kulisch & Miranker, Computer Arithmetic in Theory and Practice (Academic Press) — the definitive monograph
1983Kulisch, “A new arithmetic for scientific computation” — calls for the exact dot product as a hardware primitive
1987PASCAL-XSC released — first language with built-in exact dot product via the dotprecision type
1990C-XSC development begins at Karlsruhe and Wuppertal — C++ class library for extended scientific computing
1992XPA 3233 vector arithmetic coprocessor — first hardware implementation of the Kulisch accumulator, built at Karlsruhe
1997C-XSC 2.0 — mature library with verified solvers for linear systems, nonlinear equations, and optimization
2008Kulisch, Computer Arithmetic and Validity (2nd ed., de Gruyter) — comprehensive treatment of theory and applications
2011Kulisch & Snyder, “The exact dot product as basic tool for long interval arithmetic” — specifies a 4288-bit accumulator for IEEE binary64
2016Gustafson, The End of Error: Unum Computing — posit arithmetic adopts the super-accumulator as the “quire”
2023–presentUniversal Numbers Library: generalized quire works with any number system

Kulisch’s insight is deceptively simple:

If you never round intermediate results, the final answer can be correctly rounded in a single step.

For a dot product x·y = Σ xᵢ·yᵢ, this requires:

  1. Exact multiplication. The product of two p-bit significands is exactly 2p bits wide. No bits are lost if we keep the full product.

  2. Exact accumulation. A fixed-point register wide enough to span the full dynamic range of the number system can hold the exact sum of any number of products — the carry bits propagate exactly.

  3. Single final rounding. Only when the accumulated result is converted back to a floating-point number does rounding occur — and it is correctly rounded (to nearest, or faithfully rounded within 1 ULP).

The super-accumulator (Kulisch accumulator) is a fixed-point register whose width is determined by the dynamic range of the number system:

|<--- capacity --->|<------- upper range ------->.<------- lower range ------->|
overflow guard integer part radix fractional part

For IEEE 754 binary64 (double precision):

  • Exponent range: [−1022, +1023]
  • Product exponent range: [−2044, +2046] → 4091 bits
  • Plus sign bit and overflow capacity → ~4288 bits total

For posit<64,3>:

  • Dynamic range: 8 × (4·64 − 8) = 1984 bits
  • Plus capacity → ~2014 bits total

For posit<32,2>:

  • Dynamic range: 4 × (4·32 − 8) = 480 bits
  • Plus capacity → ~510 bits total

The posit number system’s efficient encoding of dynamic range makes the quire dramatically smaller than the IEEE equivalent — a 64-bit posit quire is less than half the size of the IEEE double quire.

Despite its elegance and proven utility, the Kulisch accumulator was never integrated into mainstream processor architectures or the IEEE 754 standard. The reasons were primarily economic and institutional:

  1. Silicon cost. In the 1980s–1990s, a 4096-bit register was considered extravagant for a single functional unit.

  2. Standards inertia. IEEE 754 was standardized in 1985 without the exact dot product. Kulisch lobbied for its inclusion in the 2008 revision, but the committee chose not to mandate it.

  3. Software workarounds. Compensated summation (Kahan, 1965) and error-free transformations (Dekker, Knuth, Priest) provided partial mitigation, reducing urgency.

  4. Diminishing visibility. As the XSC tools remained academic and never gained mainstream adoption, the next generation of computational scientists learned floating-point arithmetic without knowing the super-accumulator alternative existed.

Today, the silicon cost argument is obsolete. A 4096-bit register is trivial on a modern chip with billions of transistors. The real barrier is now awareness — which this document aims to address.


3. The Generalized Quire in the Universal Numbers Library

Section titled “3. The Generalized Quire in the Universal Numbers Library”

The Universal Numbers Library elevates Kulisch’s concept from a posit-specific feature to a number-system-agnostic super-accumulator. The generalized quire works with any scalar type that provides:

  1. A compile-time dynamic range bound (via quire_traits<T>)
  2. An unrounded product representation (via blocktriple)
  3. Functions quire_mul() and quire_resolve() for the type

This means the same accumulator architecture serves native IEEE-754 floats (float, double), configurable IEEE-style floats (cfloat), posits (posit), fixed-point (fixpnt), logarithmic (lns), double-base (dbns), and integer (integer) number systems.

#include <universal/number/quire/quire.hpp>
template<typename NumberType,
unsigned capacity = quire_traits<NumberType>::capacity,
typename LimbType = uint32_t>
class quire;
ParameterDescription
NumberTypeThe scalar type (cfloat, posit, fixpnt, lns, dbns, integer, float, double)
capacityOverflow guard bits (default 30, allows ~2³⁰ accumulations)
LimbTypeUnderlying unsigned type for carry propagation (uint32_t or uint64_t)

Each number system specializes quire_traits<T> to compute the accumulator width at compile time:

Number SystemRange FormulaExample ConfigurationQuire Bits
cfloat<32,8>2·(bias + fbits) + 2·bias + 2 + 30IEEE single-precision584
posit<32,2>2^es · (4·nbits - 8) + 30Standard 32-bit posit510
fixpnt<16,8>2·nbits + 3016-bit fixed-point62
lns<16,8>2·2^(nbits-1-rbits) + 3016-bit log number system286
floatsame as cfloat<32,8>Native IEEE single584
doublesame as cfloat<64,11>Native IEEE double4226

Each number system provides three functions:

// Full-precision multiply → unrounded product for quire accumulation
blocktriple<...> quire_mul(const Scalar& a, const Scalar& b);
// Extract final result from quire → single rounding step
Scalar quire_resolve(const quire<Scalar>& q);
// Vector dot product: accumulate all products, resolve once
Scalar fdp(const std::vector<Scalar>& x, const std::vector<Scalar>& y);

The fdp() function is the user-facing API. It creates a quire, accumulates all products via quire_mul, and resolves the result with a single rounding operation.

The library extends exact dot products to higher-level BLAS operations:

#include <blas/ext/fdp_blas.hpp>
// Fused matrix-vector multiply (BLAS Level 2)
// Each element of the result vector is an exact dot product of a row of A with x
vector<Scalar> y = fmv(A, x);
// Fused matrix-matrix multiply (BLAS Level 3)
// Each element of the result matrix is an exact dot product of a row of A with a column of B
matrix<Scalar> C = fmm(A, B);

These build on the same quire infrastructure: each output element is accumulated exactly in a quire and resolved with a single rounding step.

Step 1: quire_mul(a, b)
┌─────────────────────────────────────────────────┐
│ Multiply significands exactly (2p bits) │
│ Compute product scale = scale(a) + scale(b) │
│ Return unrounded blocktriple │
└─────────────────────────────────────────────────┘
Step 2: quire += blocktriple
┌─────────────────────────────────────────────────┐
│ Align product to quire radix point using scale │
│ Scatter product bits into accumulator limbs │
│ Propagate carries (exact integer addition) │
│ No rounding occurs │
└─────────────────────────────────────────────────┘
▼ (repeat for all products)
Step 3: quire_resolve(quire)
┌─────────────────────────────────────────────────┐
│ Find MSB position in accumulator │
│ Extract significand bits at target precision │
│ Apply rounding mode (round to nearest) │
│ Return correctly-rounded result │
└─────────────────────────────────────────────────┘

4. Demonstration: Exact Dot Products in Practice

Section titled “4. Demonstration: Exact Dot Products in Practice”

4.1 Catastrophic Cancellation — Resolved

Section titled “4.1 Catastrophic Cancellation — Resolved”

Using the Universal library, we can demonstrate that the quire resolves catastrophic cancellation that defeats IEEE arithmetic:

#include <universal/number/cfloat/cfloat.hpp>
#include <universal/number/cfloat/fdp.hpp>
using Scalar = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;
// 511 pairs of (+1e8, −1e8) that cancel, plus a small residual
std::vector<Scalar> x(1024), y(1024);
for (int i = 0; i < 511; ++i) {
x[2*i] = Scalar(1e8f); y[2*i] = Scalar(1.0f);
x[2*i + 1] = Scalar(-1e8f); y[2*i + 1] = Scalar(1.0f);
}
x[1022] = Scalar(0.5f); y[1022] = Scalar(1.0f);
x[1023] = Scalar(0.25f); y[1023] = Scalar(1.0f);
// Naive fp32 accumulation: loses the 0.75 residual
float naive = 0.0f;
for (int i = 0; i < 1024; ++i)
naive += float(x[i]) * float(y[i]);
// naive == 0.0 (WRONG)
// Fused dot product via quire: exact accumulation
Scalar result = sw::universal::fdp(x, y);
// result == 0.75 (CORRECT)

4.2 Reproducibility Across Evaluation Order

Section titled “4.2 Reproducibility Across Evaluation Order”

The quire guarantees identical results regardless of summation order:

#include <universal/number/cfloat/cfloat.hpp>
#include <universal/number/cfloat/fdp.hpp>
#include <algorithm>
#include <random>
using Scalar = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;
// Random vectors
std::vector<Scalar> x(1024), y(1024);
std::mt19937 rng(42);
std::uniform_real_distribution<float> dist(-100.0f, 100.0f);
for (int i = 0; i < 1024; ++i) {
x[i] = Scalar(dist(rng));
y[i] = Scalar(dist(rng));
}
// Forward order
Scalar result1 = sw::universal::fdp(x, y);
// Reverse order
std::vector<Scalar> xr(x.rbegin(), x.rend());
std::vector<Scalar> yr(y.rbegin(), y.rend());
Scalar result2 = sw::universal::fdp(xr, yr);
assert(double(result1) == double(result2)); // Always true

With naive floating-point accumulation, the forward and reverse results would typically differ.

Products spanning the full dynamic range accumulate correctly:

// a[i] = 2^(i-512), b[i] = 2^(512-i), so every product = 1.0
// Sum of 1024 ones = 1024.0
std::vector<Scalar> x(1024), y(1024);
for (int i = 0; i < 1024; ++i) {
int exp_x = std::max(-60, std::min(60, i - 512));
x[i] = Scalar(std::ldexp(1.0f, exp_x));
y[i] = Scalar(std::ldexp(1.0f, -exp_x));
}
Scalar result = sw::universal::fdp(x, y);
// result == 1024.0 (exact, despite 120+ orders of magnitude variation)

4.4 Number-System Agnostic: Same Algorithm, Any Type

Section titled “4.4 Number-System Agnostic: Same Algorithm, Any Type”

The generalized quire works identically across all number systems:

// Posit FDP
#include <universal/number/posit/posit.hpp>
#include <universal/number/posit/fdp.hpp>
{
using Scalar = sw::universal::posit<32, 2>;
std::vector<Scalar> x = { 3.2e8, 1, -1, 8e7 };
std::vector<Scalar> y = { 4.0e7, 1, -1, -1.6e8 };
Scalar result = sw::universal::fdp(x, y); // == 2
}
// Fixed-point FDP
#include <universal/number/fixpnt/fixpnt.hpp>
#include <universal/number/fixpnt/fdp.hpp>
{
using Scalar = sw::universal::fixpnt<16, 8>;
std::vector<Scalar> x = { 1.5, 2.25, 3.125 };
std::vector<Scalar> y = { 4.0, 5.0, 6.0 };
Scalar result = sw::universal::fdp(x, y);
}
// Logarithmic number system FDP
#include <universal/number/lns/lns.hpp>
#include <universal/number/lns/fdp.hpp>
{
using Scalar = sw::universal::lns<16, 8>;
std::vector<Scalar> x = { 1.0, 2.0 };
std::vector<Scalar> y = { 3.0, 4.0 };
Scalar result = sw::universal::fdp(x, y); // == 11
}

4.5 Quire Continuation: Batched Accumulation

Section titled “4.5 Quire Continuation: Batched Accumulation”

For streaming or batched computation, the quire supports continuation — accumulate across multiple calls without resolving:

using Scalar = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;
sw::universal::quire<Scalar> q;
// Batch 1: from sensor A
std::vector<Scalar> x1 = { /* ... */ }, y1 = { /* ... */ };
sw::universal::fdp_qc(q, x1.size(), x1, 1, y1, 1);
// Batch 2: from sensor B
std::vector<Scalar> x2 = { /* ... */ }, y2 = { /* ... */ };
sw::universal::fdp_qc(q, x2.size(), x2, 1, y2, 1);
// Resolve once at the end
Scalar result = sw::universal::quire_resolve(q);

5. Application: Verified Solution of Ill-Conditioned Linear Systems

Section titled “5. Application: Verified Solution of Ill-Conditioned Linear Systems”

The classical application of the Kulisch accumulator — and the one that motivated much of the XSC development — is verified iterative refinement for linear systems:

Algorithm: Verified Iterative Refinement with Exact Residual
Input: A ∈ ℝⁿˣⁿ, b ∈ ℝⁿ
Output: Verified enclosure [x] containing the true solution
1. Compute approximate inverse R ≈ A⁻¹ (double precision)
2. Compute C = I − RA (exact dot product via quire)
3. Compute z = Rb (exact dot product via quire)
4. Set [x] = z
5. Repeat:
[y] = C · [x] + z (interval arithmetic + exact dot product)
if [y] ⊂ [x]: (contraction verified)
return [y] (proven enclosure of true solution)
[x] = [y]

The key insight: step 2 and subsequent residual computations use the exact dot product, so the residual C = I − RA is computed to full accuracy. This guarantees that the iteration contracts, even for matrices with condition numbers exceeding 10²⁰.

The Universal library includes a Hilbert matrix test that demonstrates this application:

#include <universal/number/posit/posit.hpp>
#include <blas/blas.hpp>
#include <blas/generators.hpp>
template<typename Scalar>
void HilbertMatrixTest(size_t N = 5) {
using Matrix = sw::numeric::containers::matrix<Scalar>;
Matrix H(N, N), Hinv(N, N);
sw::blas::GenerateHilbertMatrix<Scalar>(H, false);
sw::blas::GenerateHilbertMatrixInverse<Scalar>(Hinv);
// With standard arithmetic, Hinv * H ≠ I for large N
// With FDP-based matrix multiply, Hinv * H → I
std::cout << "Validation: Hinv * H =>\n" << Hinv * H << '\n';
}

6. Broader Impact: Where Exact Dot Products Matter

Section titled “6. Broader Impact: Where Exact Dot Products Matter”

Mixed-precision training (FP16/BF16 forward, FP32 backward) loses gradient information through accumulation error. An exact accumulator for the dot products in matrix multiplication preserves gradient fidelity, enabling training with lower precision operands without loss scaling hacks.

MPI-based parallel reductions produce non-reproducible results because the reduction tree order varies across runs. FDP guarantees the same result regardless of the number of processes or reduction order — essential for debugging, verification, and regulatory compliance.

FIR and IIR filter computations are dot products of coefficient and sample vectors. Accumulated rounding error manifests as filter distortion. FDP eliminates this error source entirely.

Regulatory requirements (Basel III/IV, MiFID II) demand reproducible risk calculations. Non-deterministic floating-point accumulation creates audit failures. FDP provides bit-exact reproducibility.

Geometric predicates (orientation tests, in-circle tests) reduce to sign-of-determinant computations — which are dot products. A wrong sign means a wrong topological decision. FDP guarantees correct predicates without the complexity of adaptive precision arithmetic (Shewchuk’s approach).


ApproachAccuracyPerformanceGeneralityComplexity
Kulisch/Quire (exact)Exact (1 rounding)~1.5× naiveAny number systemLow (library call)
Kahan compensated sum~2× precision~2× naiveFloating-point onlyModerate
Error-free transforms (Ogita-Rump-Oishi)Faithful (~K rounds)~4–10× naiveIEEE 754 onlyHigh
Higher precision (double → quad)~2× precision~10–100× naiveLimited availabilityLow
Arbitrary precision (MPFR)Arbitrary~100–1000× naiveFloating-point onlyModerate

The Kulisch accumulator is unique in providing exact results with near-native performance and minimal code complexity. It is the only approach that guarantees one rounding operation for the entire dot product.


  • C++20-compliant compiler (GCC 11+, Clang 14+, MSVC 2022+)
  • CMake 3.22+
  • No external dependencies (header-only library)
Terminal window
git clone https://github.com/stillwater-sc/universal.git
cd universal
mkdir build && cd build
# Build quire tests and FDP examples
cmake -DUNIVERSAL_BUILD_NUMBER_STATICS=ON ..
make -j4
# Run the FDP tests
./static/quire/api/quire_cfloat_fdp
./static/quire/api/quire_fixpnt_fdp
./static/quire/api/quire_lns_fdp
./static/quire/api/quire_dbns_fdp
# Run the educational example
./education/quire/edu_quire_quires
# Run the application examples
./applications/reproducibility/blas/blas_l1_fdp # FDP vs naive dot product
./applications/reproducibility/blas/blas_l2_fmv # FDP matrix-vector multiply
./applications/reproducibility/blas/blas_l3_fmm # FDP matrix-matrix multiply
./applications/reproducibility/blas/blas_hilbert # Hilbert matrix verified solver
find_package(UNIVERSAL CONFIG REQUIRED)
target_link_libraries(my_app universal::universal)
#include <universal/number/cfloat/cfloat.hpp>
#include <universal/number/cfloat/fdp.hpp>
using Real = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;
std::vector<Real> x = { /* your data */ };
std::vector<Real> y = { /* your data */ };
Real result = sw::universal::fdp(x, y); // exact dot product

  1. Kulisch, U. W. and Miranker, W. L. (1981). Computer Arithmetic in Theory and Practice. Academic Press. — The foundational monograph on complete arithmetic.

  2. Kulisch, U. W. (2008). Computer Arithmetic and Validity: Theory, Implementation, and Applications. 2nd ed., de Gruyter Studies in Mathematics, vol. 33. — Comprehensive treatment of the theory and its applications to verified computing.

  3. Kulisch, U. W. and Snyder, V. (2011). “The exact dot product as basic tool for long interval arithmetic.” Computing, 91(3), 307–313. — Specifies the 4288-bit accumulator for IEEE binary64.

  4. Klatte, R., Kulisch, U., Wiethoff, A., Lawo, C., and Rauch, M. (1993). C-XSC: A C++ Class Library for Extended Scientific Computing. Springer-Verlag. — The C++ implementation of the XSC paradigm.

  5. Klatte, R., Kulisch, U., Neaga, M., Ratz, D., and Ullrich, Ch. (1992). PASCAL-XSC: Language Reference with Examples. Springer-Verlag. — The first language with built-in exact dot product.

  1. Biancolin, D., Katerman, J., Lee, H., and Wawrzynek, J. (2005). “A Hardware Accelerator for Computing an Exact Dot Product.” Proceedings of the 17th IEEE Symposium on Computer Arithmetic (ARITH-17).

  2. Uguen, Y. and de Dinechin, F. (2017). “Design-space exploration for the Kulisch accumulator.” HAL archives-ouvertes, hal-01488916v2.

9.3 Compensated Algorithms (Alternative Approaches)

Section titled “9.3 Compensated Algorithms (Alternative Approaches)”
  1. Ogita, T., Rump, S. M., and Oishi, S. (2005). “Accurate sum and dot product.” SIAM J. Sci. Comput., 26(6), 1955–1988.

  2. Rump, S. M. (2011). “Error estimation of floating-point summation and dot product.” BIT Numerical Mathematics, 52(2), 541–562.

  1. Gustafson, J. L. (2017). The End of Error: Unum Computing. CRC Press. — Introduces the quire as part of the posit number system.

  2. Gustafson, J. L. and Yonemoto, I. T. (2017). “Beating Floating Point at its Own Game: Posit Arithmetic.” Supercomputing Frontiers and Innovations, 4(2), 71–86.

9.5 Verified Computing and Interval Arithmetic

Section titled “9.5 Verified Computing and Interval Arithmetic”
  1. Rump, S. M. (1999). “INTLAB - INTerval LABoratory.” Developments in Reliable Computing, Kluwer, 77–104.

  2. Krämer, W., Bantle, A., and Grimmer, M. (2009). “Fast (Parallel) Dense Linear System Solvers in C-XSC Using Error Free Transformations and BLAS.” Springer LNCS, 5947, 230–249.

  1. Kahan, W. (1965). “Further remarks on reducing truncation errors.” Communications of the ACM, 8(1), 40.

  2. Goldberg, D. (1991). “What every computer scientist should know about floating-point arithmetic.” ACM Computing Surveys, 23(1), 5–48.


The Kulisch super-accumulator is not a theoretical curiosity — it is a solved problem that the computing industry chose not to deploy. The arithmetic is exact, the hardware cost is negligible on modern silicon, and the software is available today as a header-only C++ library.

The Universal Numbers Library’s generalized quire makes this capability accessible to any computational scientist:

  • One header to include (<universal/number/TYPE/fdp.hpp>)
  • One function to call (fdp(x, y))
  • One rounding operation for the entire dot product
  • Any number system — native float/double, IEEE-configurable floats, posits, fixed-point, logarithmic, double-base

The forgotten arithmetic of Ulrich Kulisch deserves to be rediscovered. Every matrix multiply, every iterative solver, every gradient computation in every neural network is a collection of dot products. Making those dot products exact is not a luxury — it is the straightforward, elegant, and now easily accessible solution to the most fundamental problem in numerical computing.


Number SystemConfigurationDynamic Range (bits)Quire Width (bits)
IEEE halfcfloat<16,5>94124
IEEE singlecfloat<32,8>554584
IEEE doublecfloat<64,11>41964226
bfloat16cfloat<16,8>530560
Native floatfloat554584
Native doubledouble41964226
posit<8,0>8-bit posit2454
posit<16,1>16-bit posit112142
posit<32,2>32-bit posit480510
posit<64,3>64-bit posit19842014
fixpnt<16,8>16-bit fixed3262
fixpnt<32,16>32-bit fixed6494
lns<16,8>16-bit log256286
lns<32,16>32-bit log6553665566
dbns<16,8>16-bit double-base20482078

universal/
├── include/sw/universal/
│ ├── number/quire/
│ │ ├── quire.hpp # Umbrella header
│ │ ├── quire_impl.hpp # Generalized quire implementation
│ │ └── exceptions.hpp # Quire-specific exceptions
│ ├── number/cfloat/fdp.hpp # cfloat FDP (quire_mul, quire_resolve, fdp)
│ ├── number/posit/fdp.hpp # posit FDP
│ ├── number/fixpnt/fdp.hpp # fixpnt FDP
│ ├── number/lns/fdp.hpp # lns FDP
│ ├── number/dbns/fdp.hpp # dbns FDP
│ ├── number/integer/integer_impl.hpp # integer normalize (quire input)
│ ├── native/fdp.hpp # Native float/double FDP
│ └── traits/quire_traits.hpp # Compile-time sizing for all types
├── static/quire/api/
│ ├── cfloat_fdp.cpp # Comprehensive cfloat FDP tests
│ ├── fixpnt_fdp.cpp # Fixed-point FDP tests
│ ├── lns_fdp.cpp # LNS FDP tests
│ ├── dbns_fdp.cpp # DBNS FDP tests
│ └── generalized_quire.cpp # Cross-type quire demonstration
├── education/quire/
│ └── quires.cpp # Educational quire examples
├── applications/reproducibility/blas/
│ ├── l1_fdp.cpp # FDP vs naive dot product
│ ├── l2_fmv.cpp # FDP matrix-vector multiply
│ ├── l3_fmm.cpp # FDP matrix-matrix multiply
│ └── hilbert.cpp # Hilbert matrix verified solver
└── include/sw/blas/
├── cg_fdp_solvers.hpp # Conjugate gradient with FDP
└── blas_l1.hpp # BLAS Level 1 with FDP support

This document accompanies the Universal Numbers Library (github.com/stillwater-sc/universal), an open-source, header-only C++20 template library for custom arithmetic. Licensed under the MIT License.