The Kulisch Super-Accumulator: Exact Dot Products for Reliable Numerical Computing
A Practitioner’s Guide to Reproducible Linear Algebra with the Universal Numbers Library
Theodore Omtzigt, Stillwater Supercomputing, Inc.
Abstract
Section titled “Abstract”The dot product is the most fundamental operation in numerical computing. It underlies matrix multiplication, iterative solvers, neural-network inference, signal processing, and virtually every algorithm in computational science. Yet in IEEE 754 floating-point arithmetic, the dot product is also the primary source of precision loss, non-reproducibility, and numerical failure.
In the 1970s, Ulrich Kulisch and Willard Miranker showed that a single architectural addition — a fixed-point super-accumulator wide enough to hold the exact sum of any number of floating-point products — eliminates these problems at their root. Their insight was implemented in the XSC family of languages and in dedicated hardware, but never adopted by mainstream processors or programming languages. As a result, the insight has been largely forgotten by a generation of computational scientists.
This document reconstructs the why, what, and how of the Kulisch super-accumulator, provides the historical context that motivated it, and shows how the Universal Numbers Library’s generalized quire makes exact dot products available today — as a header-only C++ template that works with any number system: IEEE-754 floats, posits, fixed-point, logarithmic, and double-base number systems.
1. The Problem: Why Floating-Point Dot Products Fail
Section titled “1. The Problem: Why Floating-Point Dot Products Fail”1.1 The Three Pathologies
Section titled “1.1 The Three Pathologies”Every floating-point dot product suffers from three interrelated pathologies:
-
Catastrophic cancellation. When large terms of opposite sign nearly cancel, the small residual that remains is dominated by the rounding errors accumulated in the large terms. The true answer is lost.
-
Non-reproducibility. Because floating-point addition is not associative, reordering the summation — by parallelism, vectorization, or even different compiler optimization levels — changes the result. The same program on different hardware produces different answers.
-
Stalled convergence in iterative refinement. Algorithms that improve an approximate solution by computing the residual r = b − Ax depend on the residual being accurate. When the residual itself is computed with a lossy dot product, refinement stalls or diverges, and ill-conditioned systems become unsolvable.
1.2 A Concrete Example: Catastrophic Cancellation
Section titled “1.2 A Concrete Example: Catastrophic Cancellation”Consider two vectors that produce large intermediate products which nearly cancel:
a = [ 3.2×10⁸, 1, −1, 8×10⁷ ]b = [ 4.0×10⁷, 1, −1, −1.6×10⁸ ]The exact dot product is:
a·b = (3.2×10⁸)(4.0×10⁷) + (1)(1) + (−1)(−1) + (8×10⁷)(−1.6×10⁸) = 1.28×10¹⁶ + 1 + 1 − 1.28×10¹⁶ = 2But the two dominant products, 1.28×10¹⁶ and −1.28×10¹⁶, consume all available significand bits when accumulated. The residual “+2” is below the rounding threshold and is lost. IEEE single-precision reports 0; IEEE double-precision may report 0 or a value far from 2, depending on evaluation order.
This is not an academic curiosity. It is the normal operating condition of iterative solvers, Gram-Schmidt orthogonalization, and any algorithm that computes differences of nearly-equal quantities.
1.3 The Iterative Refinement Failure
Section titled “1.3 The Iterative Refinement Failure”The most consequential real-world failure occurs in iterative refinement for solving linear systems Ax = b:
- Compute an approximate solution x̃ (e.g., via LU factorization).
- Compute the residual r = b − Ax̃.
- Solve Ad = r for the correction.
- Update x̃ ← x̃ + d.
- Repeat until convergence.
The critical step is (2): computing the residual. This is a dot product of each row of A with x̃, subtracted from b. For ill-conditioned systems (condition number κ >> 1), Ax̃ is very close to b, and the residual is a small difference of large quantities — precisely the catastrophic cancellation scenario.
With standard floating-point, the residual is inaccurate, refinement stalls, and convergence fails for condition numbers above ~10⁸ in single precision or ~10¹⁶ in double precision. With an exact dot product, the residual is computed without any rounding error, and convergence proceeds even for arbitrarily ill-conditioned systems.
The Hilbert matrix — where H(i,j) = 1/(i+j−1) — is the canonical test case. Its condition number grows super-exponentially:
| Dimension | Condition Number |
|---|---|
| 5 | ~4.8×10⁵ |
| 10 | ~1.6×10¹³ |
| 15 | ~3.7×10¹⁷ |
| 20 | ~1.1×10²⁸ |
Standard double-precision iterative refinement fails for n ≥ 12. With exact residual computation via the super-accumulator, refinement converges for any dimension.
2. The Solution: Kulisch’s Exact Dot Product
Section titled “2. The Solution: Kulisch’s Exact Dot Product”2.1 Historical Context
Section titled “2.1 Historical Context”The idea that computer arithmetic should guarantee mathematical correctness — not just approximate it — was articulated by Ulrich W. Kulisch beginning in the late 1960s at the University of Karlsruhe (now Karlsruhe Institute of Technology).
Key milestones:
| Year | Event |
|---|---|
| 1969 | Kulisch begins work on complete arithmetic at Karlsruhe |
| 1976 | Kulisch & Miranker, “Arithmetic operations of digital computers with a more complete number system” — lays theoretical foundation |
| 1981 | Kulisch & Miranker, Computer Arithmetic in Theory and Practice (Academic Press) — the definitive monograph |
| 1983 | Kulisch, “A new arithmetic for scientific computation” — calls for the exact dot product as a hardware primitive |
| 1987 | PASCAL-XSC released — first language with built-in exact dot product via the dotprecision type |
| 1990 | C-XSC development begins at Karlsruhe and Wuppertal — C++ class library for extended scientific computing |
| 1992 | XPA 3233 vector arithmetic coprocessor — first hardware implementation of the Kulisch accumulator, built at Karlsruhe |
| 1997 | C-XSC 2.0 — mature library with verified solvers for linear systems, nonlinear equations, and optimization |
| 2008 | Kulisch, Computer Arithmetic and Validity (2nd ed., de Gruyter) — comprehensive treatment of theory and applications |
| 2011 | Kulisch & Snyder, “The exact dot product as basic tool for long interval arithmetic” — specifies a 4288-bit accumulator for IEEE binary64 |
| 2016 | Gustafson, The End of Error: Unum Computing — posit arithmetic adopts the super-accumulator as the “quire” |
| 2023–present | Universal Numbers Library: generalized quire works with any number system |
2.2 The Core Insight
Section titled “2.2 The Core Insight”Kulisch’s insight is deceptively simple:
If you never round intermediate results, the final answer can be correctly rounded in a single step.
For a dot product x·y = Σ xᵢ·yᵢ, this requires:
-
Exact multiplication. The product of two p-bit significands is exactly 2p bits wide. No bits are lost if we keep the full product.
-
Exact accumulation. A fixed-point register wide enough to span the full dynamic range of the number system can hold the exact sum of any number of products — the carry bits propagate exactly.
-
Single final rounding. Only when the accumulated result is converted back to a floating-point number does rounding occur — and it is correctly rounded (to nearest, or faithfully rounded within 1 ULP).
2.3 The Super-Accumulator
Section titled “2.3 The Super-Accumulator”The super-accumulator (Kulisch accumulator) is a fixed-point register whose width is determined by the dynamic range of the number system:
|<--- capacity --->|<------- upper range ------->.<------- lower range ------->| overflow guard integer part radix fractional partFor IEEE 754 binary64 (double precision):
- Exponent range: [−1022, +1023]
- Product exponent range: [−2044, +2046] → 4091 bits
- Plus sign bit and overflow capacity → ~4288 bits total
For posit<64,3>:
- Dynamic range: 8 × (4·64 − 8) = 1984 bits
- Plus capacity → ~2014 bits total
For posit<32,2>:
- Dynamic range: 4 × (4·32 − 8) = 480 bits
- Plus capacity → ~510 bits total
The posit number system’s efficient encoding of dynamic range makes the quire dramatically smaller than the IEEE equivalent — a 64-bit posit quire is less than half the size of the IEEE double quire.
2.4 Why It Was Not Adopted
Section titled “2.4 Why It Was Not Adopted”Despite its elegance and proven utility, the Kulisch accumulator was never integrated into mainstream processor architectures or the IEEE 754 standard. The reasons were primarily economic and institutional:
-
Silicon cost. In the 1980s–1990s, a 4096-bit register was considered extravagant for a single functional unit.
-
Standards inertia. IEEE 754 was standardized in 1985 without the exact dot product. Kulisch lobbied for its inclusion in the 2008 revision, but the committee chose not to mandate it.
-
Software workarounds. Compensated summation (Kahan, 1965) and error-free transformations (Dekker, Knuth, Priest) provided partial mitigation, reducing urgency.
-
Diminishing visibility. As the XSC tools remained academic and never gained mainstream adoption, the next generation of computational scientists learned floating-point arithmetic without knowing the super-accumulator alternative existed.
Today, the silicon cost argument is obsolete. A 4096-bit register is trivial on a modern chip with billions of transistors. The real barrier is now awareness — which this document aims to address.
3. The Generalized Quire in the Universal Numbers Library
Section titled “3. The Generalized Quire in the Universal Numbers Library”3.1 Design Philosophy
Section titled “3.1 Design Philosophy”The Universal Numbers Library elevates Kulisch’s concept from a posit-specific feature to a number-system-agnostic super-accumulator. The generalized quire works with any scalar type that provides:
- A compile-time dynamic range bound (via
quire_traits<T>) - An unrounded product representation (via
blocktriple) - Functions
quire_mul()andquire_resolve()for the type
This means the same accumulator architecture serves native IEEE-754
floats (float, double), configurable IEEE-style floats (cfloat),
posits (posit), fixed-point (fixpnt), logarithmic (lns),
double-base (dbns), and integer (integer) number systems.
3.2 Template Interface
Section titled “3.2 Template Interface”#include <universal/number/quire/quire.hpp>
template<typename NumberType, unsigned capacity = quire_traits<NumberType>::capacity, typename LimbType = uint32_t>class quire;| Parameter | Description |
|---|---|
NumberType | The scalar type (cfloat, posit, fixpnt, lns, dbns, integer, float, double) |
capacity | Overflow guard bits (default 30, allows ~2³⁰ accumulations) |
LimbType | Underlying unsigned type for carry propagation (uint32_t or uint64_t) |
3.3 Compile-Time Sizing via quire_traits
Section titled “3.3 Compile-Time Sizing via quire_traits”Each number system specializes quire_traits<T> to compute the
accumulator width at compile time:
| Number System | Range Formula | Example Configuration | Quire Bits |
|---|---|---|---|
cfloat<32,8> | 2·(bias + fbits) + 2·bias + 2 + 30 | IEEE single-precision | 584 |
posit<32,2> | 2^es · (4·nbits - 8) + 30 | Standard 32-bit posit | 510 |
fixpnt<16,8> | 2·nbits + 30 | 16-bit fixed-point | 62 |
lns<16,8> | 2·2^(nbits-1-rbits) + 30 | 16-bit log number system | 286 |
float | same as cfloat<32,8> | Native IEEE single | 584 |
double | same as cfloat<64,11> | Native IEEE double | 4226 |
3.4 The Fused Dot Product API
Section titled “3.4 The Fused Dot Product API”Each number system provides three functions:
// Full-precision multiply → unrounded product for quire accumulationblocktriple<...> quire_mul(const Scalar& a, const Scalar& b);
// Extract final result from quire → single rounding stepScalar quire_resolve(const quire<Scalar>& q);
// Vector dot product: accumulate all products, resolve onceScalar fdp(const std::vector<Scalar>& x, const std::vector<Scalar>& y);The fdp() function is the user-facing API. It creates a quire,
accumulates all products via quire_mul, and resolves the result
with a single rounding operation.
3.4.1 Fused BLAS Operators
Section titled “3.4.1 Fused BLAS Operators”The library extends exact dot products to higher-level BLAS operations:
#include <blas/ext/fdp_blas.hpp>
// Fused matrix-vector multiply (BLAS Level 2)// Each element of the result vector is an exact dot product of a row of A with xvector<Scalar> y = fmv(A, x);
// Fused matrix-matrix multiply (BLAS Level 3)// Each element of the result matrix is an exact dot product of a row of A with a column of Bmatrix<Scalar> C = fmm(A, B);These build on the same quire infrastructure: each output element is accumulated exactly in a quire and resolved with a single rounding step.
3.5 How It Works: Step by Step
Section titled “3.5 How It Works: Step by Step”Step 1: quire_mul(a, b) ┌─────────────────────────────────────────────────┐ │ Multiply significands exactly (2p bits) │ │ Compute product scale = scale(a) + scale(b) │ │ Return unrounded blocktriple │ └─────────────────────────────────────────────────┘ │ ▼Step 2: quire += blocktriple ┌─────────────────────────────────────────────────┐ │ Align product to quire radix point using scale │ │ Scatter product bits into accumulator limbs │ │ Propagate carries (exact integer addition) │ │ No rounding occurs │ └─────────────────────────────────────────────────┘ │ ▼ (repeat for all products) │ ▼Step 3: quire_resolve(quire) ┌─────────────────────────────────────────────────┐ │ Find MSB position in accumulator │ │ Extract significand bits at target precision │ │ Apply rounding mode (round to nearest) │ │ Return correctly-rounded result │ └─────────────────────────────────────────────────┘4. Demonstration: Exact Dot Products in Practice
Section titled “4. Demonstration: Exact Dot Products in Practice”4.1 Catastrophic Cancellation — Resolved
Section titled “4.1 Catastrophic Cancellation — Resolved”Using the Universal library, we can demonstrate that the quire resolves catastrophic cancellation that defeats IEEE arithmetic:
#include <universal/number/cfloat/cfloat.hpp>#include <universal/number/cfloat/fdp.hpp>
using Scalar = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;
// 511 pairs of (+1e8, −1e8) that cancel, plus a small residualstd::vector<Scalar> x(1024), y(1024);for (int i = 0; i < 511; ++i) { x[2*i] = Scalar(1e8f); y[2*i] = Scalar(1.0f); x[2*i + 1] = Scalar(-1e8f); y[2*i + 1] = Scalar(1.0f);}x[1022] = Scalar(0.5f); y[1022] = Scalar(1.0f);x[1023] = Scalar(0.25f); y[1023] = Scalar(1.0f);
// Naive fp32 accumulation: loses the 0.75 residualfloat naive = 0.0f;for (int i = 0; i < 1024; ++i) naive += float(x[i]) * float(y[i]);// naive == 0.0 (WRONG)
// Fused dot product via quire: exact accumulationScalar result = sw::universal::fdp(x, y);// result == 0.75 (CORRECT)4.2 Reproducibility Across Evaluation Order
Section titled “4.2 Reproducibility Across Evaluation Order”The quire guarantees identical results regardless of summation order:
#include <universal/number/cfloat/cfloat.hpp>#include <universal/number/cfloat/fdp.hpp>#include <algorithm>#include <random>
using Scalar = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;
// Random vectorsstd::vector<Scalar> x(1024), y(1024);std::mt19937 rng(42);std::uniform_real_distribution<float> dist(-100.0f, 100.0f);for (int i = 0; i < 1024; ++i) { x[i] = Scalar(dist(rng)); y[i] = Scalar(dist(rng));}
// Forward orderScalar result1 = sw::universal::fdp(x, y);
// Reverse orderstd::vector<Scalar> xr(x.rbegin(), x.rend());std::vector<Scalar> yr(y.rbegin(), y.rend());Scalar result2 = sw::universal::fdp(xr, yr);
assert(double(result1) == double(result2)); // Always trueWith naive floating-point accumulation, the forward and reverse results would typically differ.
4.3 Multi-Scale Products
Section titled “4.3 Multi-Scale Products”Products spanning the full dynamic range accumulate correctly:
// a[i] = 2^(i-512), b[i] = 2^(512-i), so every product = 1.0// Sum of 1024 ones = 1024.0std::vector<Scalar> x(1024), y(1024);for (int i = 0; i < 1024; ++i) { int exp_x = std::max(-60, std::min(60, i - 512)); x[i] = Scalar(std::ldexp(1.0f, exp_x)); y[i] = Scalar(std::ldexp(1.0f, -exp_x));}Scalar result = sw::universal::fdp(x, y);// result == 1024.0 (exact, despite 120+ orders of magnitude variation)4.4 Number-System Agnostic: Same Algorithm, Any Type
Section titled “4.4 Number-System Agnostic: Same Algorithm, Any Type”The generalized quire works identically across all number systems:
// Posit FDP#include <universal/number/posit/posit.hpp>#include <universal/number/posit/fdp.hpp>{ using Scalar = sw::universal::posit<32, 2>; std::vector<Scalar> x = { 3.2e8, 1, -1, 8e7 }; std::vector<Scalar> y = { 4.0e7, 1, -1, -1.6e8 }; Scalar result = sw::universal::fdp(x, y); // == 2}
// Fixed-point FDP#include <universal/number/fixpnt/fixpnt.hpp>#include <universal/number/fixpnt/fdp.hpp>{ using Scalar = sw::universal::fixpnt<16, 8>; std::vector<Scalar> x = { 1.5, 2.25, 3.125 }; std::vector<Scalar> y = { 4.0, 5.0, 6.0 }; Scalar result = sw::universal::fdp(x, y);}
// Logarithmic number system FDP#include <universal/number/lns/lns.hpp>#include <universal/number/lns/fdp.hpp>{ using Scalar = sw::universal::lns<16, 8>; std::vector<Scalar> x = { 1.0, 2.0 }; std::vector<Scalar> y = { 3.0, 4.0 }; Scalar result = sw::universal::fdp(x, y); // == 11}4.5 Quire Continuation: Batched Accumulation
Section titled “4.5 Quire Continuation: Batched Accumulation”For streaming or batched computation, the quire supports continuation — accumulate across multiple calls without resolving:
using Scalar = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;
sw::universal::quire<Scalar> q;
// Batch 1: from sensor Astd::vector<Scalar> x1 = { /* ... */ }, y1 = { /* ... */ };sw::universal::fdp_qc(q, x1.size(), x1, 1, y1, 1);
// Batch 2: from sensor Bstd::vector<Scalar> x2 = { /* ... */ }, y2 = { /* ... */ };sw::universal::fdp_qc(q, x2.size(), x2, 1, y2, 1);
// Resolve once at the endScalar result = sw::universal::quire_resolve(q);5. Application: Verified Solution of Ill-Conditioned Linear Systems
Section titled “5. Application: Verified Solution of Ill-Conditioned Linear Systems”5.1 The Algorithm
Section titled “5.1 The Algorithm”The classical application of the Kulisch accumulator — and the one that motivated much of the XSC development — is verified iterative refinement for linear systems:
Algorithm: Verified Iterative Refinement with Exact ResidualInput: A ∈ ℝⁿˣⁿ, b ∈ ℝⁿOutput: Verified enclosure [x] containing the true solution
1. Compute approximate inverse R ≈ A⁻¹ (double precision)2. Compute C = I − RA (exact dot product via quire)3. Compute z = Rb (exact dot product via quire)4. Set [x] = z5. Repeat: [y] = C · [x] + z (interval arithmetic + exact dot product) if [y] ⊂ [x]: (contraction verified) return [y] (proven enclosure of true solution) [x] = [y]The key insight: step 2 and subsequent residual computations use the exact dot product, so the residual C = I − RA is computed to full accuracy. This guarantees that the iteration contracts, even for matrices with condition numbers exceeding 10²⁰.
5.2 Hilbert Matrix Example
Section titled “5.2 Hilbert Matrix Example”The Universal library includes a Hilbert matrix test that demonstrates this application:
#include <universal/number/posit/posit.hpp>#include <blas/blas.hpp>#include <blas/generators.hpp>
template<typename Scalar>void HilbertMatrixTest(size_t N = 5) { using Matrix = sw::numeric::containers::matrix<Scalar>; Matrix H(N, N), Hinv(N, N);
sw::blas::GenerateHilbertMatrix<Scalar>(H, false); sw::blas::GenerateHilbertMatrixInverse<Scalar>(Hinv);
// With standard arithmetic, Hinv * H ≠ I for large N // With FDP-based matrix multiply, Hinv * H → I std::cout << "Validation: Hinv * H =>\n" << Hinv * H << '\n';}6. Broader Impact: Where Exact Dot Products Matter
Section titled “6. Broader Impact: Where Exact Dot Products Matter”6.1 Deep Learning
Section titled “6.1 Deep Learning”Mixed-precision training (FP16/BF16 forward, FP32 backward) loses gradient information through accumulation error. An exact accumulator for the dot products in matrix multiplication preserves gradient fidelity, enabling training with lower precision operands without loss scaling hacks.
6.2 High-Performance Computing
Section titled “6.2 High-Performance Computing”MPI-based parallel reductions produce non-reproducible results because the reduction tree order varies across runs. FDP guarantees the same result regardless of the number of processes or reduction order — essential for debugging, verification, and regulatory compliance.
6.3 Digital Signal Processing
Section titled “6.3 Digital Signal Processing”FIR and IIR filter computations are dot products of coefficient and sample vectors. Accumulated rounding error manifests as filter distortion. FDP eliminates this error source entirely.
6.4 Financial Computing
Section titled “6.4 Financial Computing”Regulatory requirements (Basel III/IV, MiFID II) demand reproducible risk calculations. Non-deterministic floating-point accumulation creates audit failures. FDP provides bit-exact reproducibility.
6.5 Computational Geometry
Section titled “6.5 Computational Geometry”Geometric predicates (orientation tests, in-circle tests) reduce to sign-of-determinant computations — which are dot products. A wrong sign means a wrong topological decision. FDP guarantees correct predicates without the complexity of adaptive precision arithmetic (Shewchuk’s approach).
7. Comparison with Alternative Approaches
Section titled “7. Comparison with Alternative Approaches”| Approach | Accuracy | Performance | Generality | Complexity |
|---|---|---|---|---|
| Kulisch/Quire (exact) | Exact (1 rounding) | ~1.5× naive | Any number system | Low (library call) |
| Kahan compensated sum | ~2× precision | ~2× naive | Floating-point only | Moderate |
| Error-free transforms (Ogita-Rump-Oishi) | Faithful (~K rounds) | ~4–10× naive | IEEE 754 only | High |
| Higher precision (double → quad) | ~2× precision | ~10–100× naive | Limited availability | Low |
| Arbitrary precision (MPFR) | Arbitrary | ~100–1000× naive | Floating-point only | Moderate |
The Kulisch accumulator is unique in providing exact results with near-native performance and minimal code complexity. It is the only approach that guarantees one rounding operation for the entire dot product.
8. Building and Running the Examples
Section titled “8. Building and Running the Examples”8.1 Prerequisites
Section titled “8.1 Prerequisites”- C++20-compliant compiler (GCC 11+, Clang 14+, MSVC 2022+)
- CMake 3.22+
- No external dependencies (header-only library)
8.2 Build Commands
Section titled “8.2 Build Commands”git clone https://github.com/stillwater-sc/universal.gitcd universalmkdir build && cd build
# Build quire tests and FDP examplescmake -DUNIVERSAL_BUILD_NUMBER_STATICS=ON ..make -j4
# Run the FDP tests./static/quire/api/quire_cfloat_fdp./static/quire/api/quire_fixpnt_fdp./static/quire/api/quire_lns_fdp./static/quire/api/quire_dbns_fdp
# Run the educational example./education/quire/edu_quire_quires
# Run the application examples./applications/reproducibility/blas/blas_l1_fdp # FDP vs naive dot product./applications/reproducibility/blas/blas_l2_fmv # FDP matrix-vector multiply./applications/reproducibility/blas/blas_l3_fmm # FDP matrix-matrix multiply./applications/reproducibility/blas/blas_hilbert # Hilbert matrix verified solver8.3 Using FDP in Your Project
Section titled “8.3 Using FDP in Your Project”find_package(UNIVERSAL CONFIG REQUIRED)target_link_libraries(my_app universal::universal)#include <universal/number/cfloat/cfloat.hpp>#include <universal/number/cfloat/fdp.hpp>
using Real = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;
std::vector<Real> x = { /* your data */ };std::vector<Real> y = { /* your data */ };Real result = sw::universal::fdp(x, y); // exact dot product9. Related Work and Further Reading
Section titled “9. Related Work and Further Reading”9.1 Primary References
Section titled “9.1 Primary References”-
Kulisch, U. W. and Miranker, W. L. (1981). Computer Arithmetic in Theory and Practice. Academic Press. — The foundational monograph on complete arithmetic.
-
Kulisch, U. W. (2008). Computer Arithmetic and Validity: Theory, Implementation, and Applications. 2nd ed., de Gruyter Studies in Mathematics, vol. 33. — Comprehensive treatment of the theory and its applications to verified computing.
-
Kulisch, U. W. and Snyder, V. (2011). “The exact dot product as basic tool for long interval arithmetic.” Computing, 91(3), 307–313. — Specifies the 4288-bit accumulator for IEEE binary64.
-
Klatte, R., Kulisch, U., Wiethoff, A., Lawo, C., and Rauch, M. (1993). C-XSC: A C++ Class Library for Extended Scientific Computing. Springer-Verlag. — The C++ implementation of the XSC paradigm.
-
Klatte, R., Kulisch, U., Neaga, M., Ratz, D., and Ullrich, Ch. (1992). PASCAL-XSC: Language Reference with Examples. Springer-Verlag. — The first language with built-in exact dot product.
9.2 Hardware Implementations
Section titled “9.2 Hardware Implementations”-
Biancolin, D., Katerman, J., Lee, H., and Wawrzynek, J. (2005). “A Hardware Accelerator for Computing an Exact Dot Product.” Proceedings of the 17th IEEE Symposium on Computer Arithmetic (ARITH-17).
-
Uguen, Y. and de Dinechin, F. (2017). “Design-space exploration for the Kulisch accumulator.” HAL archives-ouvertes, hal-01488916v2.
9.3 Compensated Algorithms (Alternative Approaches)
Section titled “9.3 Compensated Algorithms (Alternative Approaches)”-
Ogita, T., Rump, S. M., and Oishi, S. (2005). “Accurate sum and dot product.” SIAM J. Sci. Comput., 26(6), 1955–1988.
-
Rump, S. M. (2011). “Error estimation of floating-point summation and dot product.” BIT Numerical Mathematics, 52(2), 541–562.
9.4 Posit Arithmetic and the Quire
Section titled “9.4 Posit Arithmetic and the Quire”-
Gustafson, J. L. (2017). The End of Error: Unum Computing. CRC Press. — Introduces the quire as part of the posit number system.
-
Gustafson, J. L. and Yonemoto, I. T. (2017). “Beating Floating Point at its Own Game: Posit Arithmetic.” Supercomputing Frontiers and Innovations, 4(2), 71–86.
9.5 Verified Computing and Interval Arithmetic
Section titled “9.5 Verified Computing and Interval Arithmetic”-
Rump, S. M. (1999). “INTLAB - INTerval LABoratory.” Developments in Reliable Computing, Kluwer, 77–104.
-
Krämer, W., Bantle, A., and Grimmer, M. (2009). “Fast (Parallel) Dense Linear System Solvers in C-XSC Using Error Free Transformations and BLAS.” Springer LNCS, 5947, 230–249.
9.6 Classical Summation Error Analysis
Section titled “9.6 Classical Summation Error Analysis”-
Kahan, W. (1965). “Further remarks on reducing truncation errors.” Communications of the ACM, 8(1), 40.
-
Goldberg, D. (1991). “What every computer scientist should know about floating-point arithmetic.” ACM Computing Surveys, 23(1), 5–48.
10. Conclusion
Section titled “10. Conclusion”The Kulisch super-accumulator is not a theoretical curiosity — it is a solved problem that the computing industry chose not to deploy. The arithmetic is exact, the hardware cost is negligible on modern silicon, and the software is available today as a header-only C++ library.
The Universal Numbers Library’s generalized quire makes this capability accessible to any computational scientist:
- One header to include (
<universal/number/TYPE/fdp.hpp>) - One function to call (
fdp(x, y)) - One rounding operation for the entire dot product
- Any number system — native float/double, IEEE-configurable floats, posits, fixed-point, logarithmic, double-base
The forgotten arithmetic of Ulrich Kulisch deserves to be rediscovered. Every matrix multiply, every iterative solver, every gradient computation in every neural network is a collection of dot products. Making those dot products exact is not a luxury — it is the straightforward, elegant, and now easily accessible solution to the most fundamental problem in numerical computing.
Appendix A: Quire Size Reference Table
Section titled “Appendix A: Quire Size Reference Table”| Number System | Configuration | Dynamic Range (bits) | Quire Width (bits) |
|---|---|---|---|
| IEEE half | cfloat<16,5> | 94 | 124 |
| IEEE single | cfloat<32,8> | 554 | 584 |
| IEEE double | cfloat<64,11> | 4196 | 4226 |
| bfloat16 | cfloat<16,8> | 530 | 560 |
| Native float | float | 554 | 584 |
| Native double | double | 4196 | 4226 |
| posit<8,0> | 8-bit posit | 24 | 54 |
| posit<16,1> | 16-bit posit | 112 | 142 |
| posit<32,2> | 32-bit posit | 480 | 510 |
| posit<64,3> | 64-bit posit | 1984 | 2014 |
| fixpnt<16,8> | 16-bit fixed | 32 | 62 |
| fixpnt<32,16> | 32-bit fixed | 64 | 94 |
| lns<16,8> | 16-bit log | 256 | 286 |
| lns<32,16> | 32-bit log | 65536 | 65566 |
| dbns<16,8> | 16-bit double-base | 2048 | 2078 |
Appendix B: Repository Map
Section titled “Appendix B: Repository Map”universal/├── include/sw/universal/│ ├── number/quire/│ │ ├── quire.hpp # Umbrella header│ │ ├── quire_impl.hpp # Generalized quire implementation│ │ └── exceptions.hpp # Quire-specific exceptions│ ├── number/cfloat/fdp.hpp # cfloat FDP (quire_mul, quire_resolve, fdp)│ ├── number/posit/fdp.hpp # posit FDP│ ├── number/fixpnt/fdp.hpp # fixpnt FDP│ ├── number/lns/fdp.hpp # lns FDP│ ├── number/dbns/fdp.hpp # dbns FDP│ ├── number/integer/integer_impl.hpp # integer normalize (quire input)│ ├── native/fdp.hpp # Native float/double FDP│ └── traits/quire_traits.hpp # Compile-time sizing for all types├── static/quire/api/│ ├── cfloat_fdp.cpp # Comprehensive cfloat FDP tests│ ├── fixpnt_fdp.cpp # Fixed-point FDP tests│ ├── lns_fdp.cpp # LNS FDP tests│ ├── dbns_fdp.cpp # DBNS FDP tests│ └── generalized_quire.cpp # Cross-type quire demonstration├── education/quire/│ └── quires.cpp # Educational quire examples├── applications/reproducibility/blas/│ ├── l1_fdp.cpp # FDP vs naive dot product│ ├── l2_fmv.cpp # FDP matrix-vector multiply│ ├── l3_fmm.cpp # FDP matrix-matrix multiply│ └── hilbert.cpp # Hilbert matrix verified solver└── include/sw/blas/ ├── cg_fdp_solvers.hpp # Conjugate gradient with FDP └── blas_l1.hpp # BLAS Level 1 with FDP supportThis document accompanies the Universal Numbers Library (github.com/stillwater-sc/universal), an open-source, header-only C++20 template library for custom arithmetic. Licensed under the MIT License.