The Kulisch Super-Accumulator: Exact Dot Products for Reliable Numerical Computing

A Practitioner’s Guide to Reproducible Linear Algebra with the Universal Numbers Library

Theodore Omtzigt, Stillwater Supercomputing, Inc.

Abstract

The dot product is the most fundamental operation in numerical computing. It underlies matrix multiplication, iterative solvers, neural-network inference, signal processing, and virtually every algorithm in computational science. Yet in IEEE 754 floating-point arithmetic, the dot product is also the primary source of precision loss, non-reproducibility, and numerical failure.

In the 1970s, Ulrich Kulisch and Willard Miranker showed that a single architectural addition — a fixed-point super-accumulator wide enough to hold the exact sum of any number of floating-point products — eliminates these problems at their root. Their insight was implemented in the XSC family of languages and in dedicated hardware, but never adopted by mainstream processors or programming languages. As a result, the insight has been largely forgotten by a generation of computational scientists.

This document reconstructs the why, what, and how of the Kulisch super-accumulator, provides the historical context that motivated it, and shows how the Universal Numbers Library’s generalized quire makes exact dot products available today — as a header-only C++ template that works with any number system: IEEE-754 floats, posits, fixed-point, logarithmic, and double-base number systems.

1. The Problem: Why Floating-Point Dot Products Fail

1.1 The Three Pathologies

Every floating-point dot product suffers from three interrelated pathologies:

Catastrophic cancellation. When large terms of opposite sign nearly cancel, the small residual that remains is dominated by the rounding errors accumulated in the large terms. The true answer is lost.
Non-reproducibility. Because floating-point addition is not associative, reordering the summation — by parallelism, vectorization, or even different compiler optimization levels — changes the result. The same program on different hardware produces different answers.
Stalled convergence in iterative refinement. Algorithms that improve an approximate solution by computing the residual r = b − Ax depend on the residual being accurate. When the residual itself is computed with a lossy dot product, refinement stalls or diverges, and ill-conditioned systems become unsolvable.

1.2 A Concrete Example: Catastrophic Cancellation

Consider two vectors that produce large intermediate products which nearly cancel:

a = [ 3.2×10⁸,   1,  −1,   8×10⁷ ]
b = [ 4.0×10⁷,   1,  −1,  −1.6×10⁸ ]

The exact dot product is:

a·b = (3.2×10⁸)(4.0×10⁷) + (1)(1) + (−1)(−1) + (8×10⁷)(−1.6×10⁸)
    = 1.28×10¹⁶ + 1 + 1 − 1.28×10¹⁶
    = 2

But the two dominant products, 1.28×10¹⁶ and −1.28×10¹⁶, consume all available significand bits when accumulated. The residual “+2” is below the rounding threshold and is lost. IEEE single-precision reports 0; IEEE double-precision may report 0 or a value far from 2, depending on evaluation order.

This is not an academic curiosity. It is the normal operating condition of iterative solvers, Gram-Schmidt orthogonalization, and any algorithm that computes differences of nearly-equal quantities.

The most consequential real-world failure occurs in iterative refinement for solving linear systems Ax = b:

Compute an approximate solution x̃ (e.g., via LU factorization).
Compute the residual r = b − Ax̃.
Solve Ad = r for the correction.
Update x̃ ← x̃ + d.
Repeat until convergence.

The critical step is (2): computing the residual. This is a dot product of each row of A with x̃, subtracted from b. For ill-conditioned systems (condition number κ >> 1), Ax̃ is very close to b, and the residual is a small difference of large quantities — precisely the catastrophic cancellation scenario.

With standard floating-point, the residual is inaccurate, refinement stalls, and convergence fails for condition numbers above ~10⁸ in single precision or ~10¹⁶ in double precision. With an exact dot product, the residual is computed without any rounding error, and convergence proceeds even for arbitrarily ill-conditioned systems.

The Hilbert matrix — where H(i,j) = 1/(i+j−1) — is the canonical test case. Its condition number grows super-exponentially:

Dimension	Condition Number
5	~4.8×10⁵
10	~1.6×10¹³
15	~3.7×10¹⁷
20	~1.1×10²⁸

Standard double-precision iterative refinement fails for n ≥ 12. With exact residual computation via the super-accumulator, refinement converges for any dimension.

2. The Solution: Kulisch’s Exact Dot Product

2.1 Historical Context

The idea that computer arithmetic should guarantee mathematical correctness — not just approximate it — was articulated by Ulrich W. Kulisch beginning in the late 1960s at the University of Karlsruhe (now Karlsruhe Institute of Technology).

Key milestones:

Year	Event
1969	Kulisch begins work on complete arithmetic at Karlsruhe
1976	Kulisch & Miranker, “Arithmetic operations of digital computers with a more complete number system” — lays theoretical foundation
1981	Kulisch & Miranker, Computer Arithmetic in Theory and Practice (Academic Press) — the definitive monograph
1983	Kulisch, “A new arithmetic for scientific computation” — calls for the exact dot product as a hardware primitive
1987	PASCAL-XSC released — first language with built-in exact dot product via the `dotprecision` type
1990	C-XSC development begins at Karlsruhe and Wuppertal — C++ class library for extended scientific computing
1992	XPA 3233 vector arithmetic coprocessor — first hardware implementation of the Kulisch accumulator, built at Karlsruhe
1997	C-XSC 2.0 — mature library with verified solvers for linear systems, nonlinear equations, and optimization
2008	Kulisch, Computer Arithmetic and Validity (2nd ed., de Gruyter) — comprehensive treatment of theory and applications
2011	Kulisch & Snyder, “The exact dot product as basic tool for long interval arithmetic” — specifies a 4288-bit accumulator for IEEE binary64
2016	Gustafson, The End of Error: Unum Computing — posit arithmetic adopts the super-accumulator as the “quire”
2023–present	Universal Numbers Library: generalized quire works with any number system

2.2 The Core Insight

Kulisch’s insight is deceptively simple:

If you never round intermediate results, the final answer can be correctly rounded in a single step.

For a dot product x·y = Σ xᵢ·yᵢ, this requires:

Exact multiplication. The product of two p-bit significands is exactly 2p bits wide. No bits are lost if we keep the full product.
Exact accumulation. A fixed-point register wide enough to span the full dynamic range of the number system can hold the exact sum of any number of products — the carry bits propagate exactly.
Single final rounding. Only when the accumulated result is converted back to a floating-point number does rounding occur — and it is correctly rounded (to nearest, or faithfully rounded within 1 ULP).

2.3 The Super-Accumulator

The super-accumulator (Kulisch accumulator) is a fixed-point register whose width is determined by the dynamic range of the number system:

|<--- capacity --->|<------- upper range ------->.<------- lower range ------->|
   overflow guard          integer part           radix       fractional part

For IEEE 754 binary64 (double precision):

Exponent range: [−1022, +1023]
Product exponent range: [−2044, +2046] → 4091 bits
Plus sign bit and overflow capacity → ~4288 bits total

For posit<64,3>:

Dynamic range: 8 × (4·64 − 8) = 1984 bits
Plus capacity → ~2014 bits total

For posit<32,2>:

Dynamic range: 4 × (4·32 − 8) = 480 bits
Plus capacity → ~510 bits total

The posit number system’s efficient encoding of dynamic range makes the quire dramatically smaller than the IEEE equivalent — a 64-bit posit quire is less than half the size of the IEEE double quire.

2.4 Why It Was Not Adopted

Despite its elegance and proven utility, the Kulisch accumulator was never integrated into mainstream processor architectures or the IEEE 754 standard. The reasons were primarily economic and institutional:

Silicon cost. In the 1980s–1990s, a 4096-bit register was considered extravagant for a single functional unit.
Standards inertia. IEEE 754 was standardized in 1985 without the exact dot product. Kulisch lobbied for its inclusion in the 2008 revision, but the committee chose not to mandate it.
Software workarounds. Compensated summation (Kahan, 1965) and error-free transformations (Dekker, Knuth, Priest) provided partial mitigation, reducing urgency.
Diminishing visibility. As the XSC tools remained academic and never gained mainstream adoption, the next generation of computational scientists learned floating-point arithmetic without knowing the super-accumulator alternative existed.

Today, the silicon cost argument is obsolete. A 4096-bit register is trivial on a modern chip with billions of transistors. The real barrier is now awareness — which this document aims to address.

3. The Generalized Quire in the Universal Numbers Library

3.1 Design Philosophy

The Universal Numbers Library elevates Kulisch’s concept from a posit-specific feature to a number-system-agnostic super-accumulator. The generalized quire works with any scalar type that provides:

A compile-time dynamic range bound (via quire_traits<T>)
An unrounded product representation (via blocktriple)
Functions quire_mul() and quire_resolve() for the type

This means the same accumulator architecture serves native IEEE-754 floats (float, double), configurable IEEE-style floats (cfloat), posits (posit), fixed-point (fixpnt), logarithmic (lns), double-base (dbns), and integer (integer) number systems.

3.2 Template Interface

#include <universal/number/quire/quire.hpp>

template<typename NumberType,
         unsigned capacity = quire_traits<NumberType>::capacity,
         typename LimbType = uint32_t>
class quire;

Parameter	Description
`NumberType`	The scalar type (cfloat, posit, fixpnt, lns, dbns, integer, float, double)
`capacity`	Overflow guard bits (default 30, allows ~2³⁰ accumulations)
`LimbType`	Underlying unsigned type for carry propagation (uint32_t or uint64_t)

3.3 Compile-Time Sizing via `quire_traits`

Each number system specializes quire_traits<T> to compute the accumulator width at compile time:

Number System	Range Formula	Example Configuration	Quire Bits
`cfloat<32,8>`	2·(bias + fbits) + 2·bias + 2 + 30	IEEE single-precision	584
`posit<32,2>`	2^es · (4·nbits - 8) + 30	Standard 32-bit posit	510
`fixpnt<16,8>`	2·nbits + 30	16-bit fixed-point	62
`lns<16,8>`	2·2^(nbits-1-rbits) + 30	16-bit log number system	286
`float`	same as cfloat<32,8>	Native IEEE single	584
`double`	same as cfloat<64,11>	Native IEEE double	4226

3.4 The Fused Dot Product API

Each number system provides three functions:

// Full-precision multiply → unrounded product for quire accumulation
blocktriple<...> quire_mul(const Scalar& a, const Scalar& b);

// Extract final result from quire → single rounding step
Scalar quire_resolve(const quire<Scalar>& q);

// Vector dot product: accumulate all products, resolve once
Scalar fdp(const std::vector<Scalar>& x, const std::vector<Scalar>& y);

The fdp() function is the user-facing API. It creates a quire, accumulates all products via quire_mul, and resolves the result with a single rounding operation.

3.4.1 Fused BLAS Operators

The library extends exact dot products to higher-level BLAS operations:

#include <blas/ext/fdp_blas.hpp>

// Fused matrix-vector multiply (BLAS Level 2)
// Each element of the result vector is an exact dot product of a row of A with x
vector<Scalar> y = fmv(A, x);

// Fused matrix-matrix multiply (BLAS Level 3)
// Each element of the result matrix is an exact dot product of a row of A with a column of B
matrix<Scalar> C = fmm(A, B);

These build on the same quire infrastructure: each output element is accumulated exactly in a quire and resolved with a single rounding step.

3.5 How It Works: Step by Step

Step 1: quire_mul(a, b)
  ┌─────────────────────────────────────────────────┐
  │  Multiply significands exactly (2p bits)         │
  │  Compute product scale = scale(a) + scale(b)     │
  │  Return unrounded blocktriple                    │
  └─────────────────────────────────────────────────┘
              │
              ▼
Step 2: quire += blocktriple
  ┌─────────────────────────────────────────────────┐
  │  Align product to quire radix point using scale  │
  │  Scatter product bits into accumulator limbs     │
  │  Propagate carries (exact integer addition)      │
  │  No rounding occurs                              │
  └─────────────────────────────────────────────────┘
              │
              ▼  (repeat for all products)
              │
              ▼
Step 3: quire_resolve(quire)
  ┌─────────────────────────────────────────────────┐
  │  Find MSB position in accumulator                │
  │  Extract significand bits at target precision    │
  │  Apply rounding mode (round to nearest)          │
  │  Return correctly-rounded result                 │
  └─────────────────────────────────────────────────┘

4. Demonstration: Exact Dot Products in Practice

4.1 Catastrophic Cancellation — Resolved

Using the Universal library, we can demonstrate that the quire resolves catastrophic cancellation that defeats IEEE arithmetic:

#include <universal/number/cfloat/cfloat.hpp>
#include <universal/number/cfloat/fdp.hpp>

using Scalar = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;

// 511 pairs of (+1e8, −1e8) that cancel, plus a small residual
std::vector<Scalar> x(1024), y(1024);
for (int i = 0; i < 511; ++i) {
    x[2*i]     = Scalar(1e8f);   y[2*i]     = Scalar(1.0f);
    x[2*i + 1] = Scalar(-1e8f);  y[2*i + 1] = Scalar(1.0f);
}
x[1022] = Scalar(0.5f);   y[1022] = Scalar(1.0f);
x[1023] = Scalar(0.25f);  y[1023] = Scalar(1.0f);

// Naive fp32 accumulation: loses the 0.75 residual
float naive = 0.0f;
for (int i = 0; i < 1024; ++i)
    naive += float(x[i]) * float(y[i]);
// naive == 0.0  (WRONG)

// Fused dot product via quire: exact accumulation
Scalar result = sw::universal::fdp(x, y);
// result == 0.75  (CORRECT)

4.2 Reproducibility Across Evaluation Order

The quire guarantees identical results regardless of summation order:

#include <universal/number/cfloat/cfloat.hpp>
#include <universal/number/cfloat/fdp.hpp>
#include <algorithm>
#include <random>

using Scalar = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;

// Random vectors
std::vector<Scalar> x(1024), y(1024);
std::mt19937 rng(42);
std::uniform_real_distribution<float> dist(-100.0f, 100.0f);
for (int i = 0; i < 1024; ++i) {
    x[i] = Scalar(dist(rng));
    y[i] = Scalar(dist(rng));
}

// Forward order
Scalar result1 = sw::universal::fdp(x, y);

// Reverse order
std::vector<Scalar> xr(x.rbegin(), x.rend());
std::vector<Scalar> yr(y.rbegin(), y.rend());
Scalar result2 = sw::universal::fdp(xr, yr);

assert(double(result1) == double(result2));  // Always true

With naive floating-point accumulation, the forward and reverse results would typically differ.

4.3 Multi-Scale Products

Products spanning the full dynamic range accumulate correctly:

// a[i] = 2^(i-512), b[i] = 2^(512-i), so every product = 1.0
// Sum of 1024 ones = 1024.0
std::vector<Scalar> x(1024), y(1024);
for (int i = 0; i < 1024; ++i) {
    int exp_x = std::max(-60, std::min(60, i - 512));
    x[i] = Scalar(std::ldexp(1.0f, exp_x));
    y[i] = Scalar(std::ldexp(1.0f, -exp_x));
}
Scalar result = sw::universal::fdp(x, y);
// result == 1024.0  (exact, despite 120+ orders of magnitude variation)

4.4 Number-System Agnostic: Same Algorithm, Any Type

The generalized quire works identically across all number systems:

// Posit FDP
#include <universal/number/posit/posit.hpp>
#include <universal/number/posit/fdp.hpp>
{
    using Scalar = sw::universal::posit<32, 2>;
    std::vector<Scalar> x = { 3.2e8, 1, -1, 8e7 };
    std::vector<Scalar> y = { 4.0e7, 1, -1, -1.6e8 };
    Scalar result = sw::universal::fdp(x, y);  // == 2
}

// Fixed-point FDP
#include <universal/number/fixpnt/fixpnt.hpp>
#include <universal/number/fixpnt/fdp.hpp>
{
    using Scalar = sw::universal::fixpnt<16, 8>;
    std::vector<Scalar> x = { 1.5, 2.25, 3.125 };
    std::vector<Scalar> y = { 4.0, 5.0, 6.0 };
    Scalar result = sw::universal::fdp(x, y);
}

// Logarithmic number system FDP
#include <universal/number/lns/lns.hpp>
#include <universal/number/lns/fdp.hpp>
{
    using Scalar = sw::universal::lns<16, 8>;
    std::vector<Scalar> x = { 1.0, 2.0 };
    std::vector<Scalar> y = { 3.0, 4.0 };
    Scalar result = sw::universal::fdp(x, y);  // == 11
}

4.5 Quire Continuation: Batched Accumulation

For streaming or batched computation, the quire supports continuation — accumulate across multiple calls without resolving:

using Scalar = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;

sw::universal::quire<Scalar> q;

// Batch 1: from sensor A
std::vector<Scalar> x1 = { /* ... */ }, y1 = { /* ... */ };
sw::universal::fdp_qc(q, x1.size(), x1, 1, y1, 1);

// Batch 2: from sensor B
std::vector<Scalar> x2 = { /* ... */ }, y2 = { /* ... */ };
sw::universal::fdp_qc(q, x2.size(), x2, 1, y2, 1);

// Resolve once at the end
Scalar result = sw::universal::quire_resolve(q);

5. Application: Verified Solution of Ill-Conditioned Linear Systems

5.1 The Algorithm

The classical application of the Kulisch accumulator — and the one that motivated much of the XSC development — is verified iterative refinement for linear systems:

Algorithm: Verified Iterative Refinement with Exact Residual
Input: A ∈ ℝⁿˣⁿ, b ∈ ℝⁿ
Output: Verified enclosure [x] containing the true solution

1. Compute approximate inverse R ≈ A⁻¹  (double precision)
2. Compute C = I − RA                    (exact dot product via quire)
3. Compute z = Rb                         (exact dot product via quire)
4. Set [x] = z
5. Repeat:
     [y] = C · [x] + z                   (interval arithmetic + exact dot product)
     if [y] ⊂ [x]:                       (contraction verified)
       return [y]                         (proven enclosure of true solution)
     [x] = [y]

The key insight: step 2 and subsequent residual computations use the exact dot product, so the residual C = I − RA is computed to full accuracy. This guarantees that the iteration contracts, even for matrices with condition numbers exceeding 10²⁰.

5.2 Hilbert Matrix Example

The Universal library includes a Hilbert matrix test that demonstrates this application:

#include <universal/number/posit/posit.hpp>
#include <blas/blas.hpp>
#include <blas/generators.hpp>

template<typename Scalar>
void HilbertMatrixTest(size_t N = 5) {
    using Matrix = sw::numeric::containers::matrix<Scalar>;
    Matrix H(N, N), Hinv(N, N);

    sw::blas::GenerateHilbertMatrix<Scalar>(H, false);
    sw::blas::GenerateHilbertMatrixInverse<Scalar>(Hinv);

    // With standard arithmetic, Hinv * H ≠ I for large N
    // With FDP-based matrix multiply, Hinv * H → I
    std::cout << "Validation: Hinv * H =>\n" << Hinv * H << '\n';
}

6. Broader Impact: Where Exact Dot Products Matter

6.1 Deep Learning

Mixed-precision training (FP16/BF16 forward, FP32 backward) loses gradient information through accumulation error. An exact accumulator for the dot products in matrix multiplication preserves gradient fidelity, enabling training with lower precision operands without loss scaling hacks.

6.2 High-Performance Computing

MPI-based parallel reductions produce non-reproducible results because the reduction tree order varies across runs. FDP guarantees the same result regardless of the number of processes or reduction order — essential for debugging, verification, and regulatory compliance.

6.3 Digital Signal Processing

FIR and IIR filter computations are dot products of coefficient and sample vectors. Accumulated rounding error manifests as filter distortion. FDP eliminates this error source entirely.

6.4 Financial Computing

Regulatory requirements (Basel III/IV, MiFID II) demand reproducible risk calculations. Non-deterministic floating-point accumulation creates audit failures. FDP provides bit-exact reproducibility.

6.5 Computational Geometry

Geometric predicates (orientation tests, in-circle tests) reduce to sign-of-determinant computations — which are dot products. A wrong sign means a wrong topological decision. FDP guarantees correct predicates without the complexity of adaptive precision arithmetic (Shewchuk’s approach).

7. Comparison with Alternative Approaches

Approach	Accuracy	Performance	Generality	Complexity
Kulisch/Quire (exact)	Exact (1 rounding)	~1.5× naive	Any number system	Low (library call)
Kahan compensated sum	~2× precision	~2× naive	Floating-point only	Moderate
Error-free transforms (Ogita-Rump-Oishi)	Faithful (~K rounds)	~4–10× naive	IEEE 754 only	High
Higher precision (double → quad)	~2× precision	~10–100× naive	Limited availability	Low
Arbitrary precision (MPFR)	Arbitrary	~100–1000× naive	Floating-point only	Moderate

The Kulisch accumulator is unique in providing exact results with near-native performance and minimal code complexity. It is the only approach that guarantees one rounding operation for the entire dot product.

8. Building and Running the Examples

8.1 Prerequisites

C++20-compliant compiler (GCC 11+, Clang 14+, MSVC 2022+)
CMake 3.22+
No external dependencies (header-only library)

8.2 Build Commands

git clone https://github.com/stillwater-sc/universal.git
cd universal
mkdir build && cd build

# Build quire tests and FDP examples
cmake -DUNIVERSAL_BUILD_NUMBER_STATICS=ON ..
make -j4

# Run the FDP tests
./static/quire/api/quire_cfloat_fdp
./static/quire/api/quire_fixpnt_fdp
./static/quire/api/quire_lns_fdp
./static/quire/api/quire_dbns_fdp

# Run the educational example
./education/quire/edu_quire_quires

# Run the application examples
./applications/reproducibility/blas/blas_l1_fdp     # FDP vs naive dot product
./applications/reproducibility/blas/blas_l2_fmv     # FDP matrix-vector multiply
./applications/reproducibility/blas/blas_l3_fmm     # FDP matrix-matrix multiply
./applications/reproducibility/blas/blas_hilbert     # Hilbert matrix verified solver

8.3 Using FDP in Your Project

find_package(UNIVERSAL CONFIG REQUIRED)
target_link_libraries(my_app universal::universal)

#include <universal/number/cfloat/cfloat.hpp>
#include <universal/number/cfloat/fdp.hpp>

using Real = sw::universal::cfloat<32, 8, uint32_t, true, false, false>;

std::vector<Real> x = { /* your data */ };
std::vector<Real> y = { /* your data */ };
Real result = sw::universal::fdp(x, y);  // exact dot product

9.1 Primary References

Kulisch, U. W. and Miranker, W. L. (1981). Computer Arithmetic in Theory and Practice. Academic Press. — The foundational monograph on complete arithmetic.
Kulisch, U. W. (2008). Computer Arithmetic and Validity: Theory, Implementation, and Applications. 2nd ed., de Gruyter Studies in Mathematics, vol. 33. — Comprehensive treatment of the theory and its applications to verified computing.
Kulisch, U. W. and Snyder, V. (2011). “The exact dot product as basic tool for long interval arithmetic.” Computing, 91(3), 307–313. — Specifies the 4288-bit accumulator for IEEE binary64.
Klatte, R., Kulisch, U., Wiethoff, A., Lawo, C., and Rauch, M. (1993). C-XSC: A C++ Class Library for Extended Scientific Computing. Springer-Verlag. — The C++ implementation of the XSC paradigm.
Klatte, R., Kulisch, U., Neaga, M., Ratz, D., and Ullrich, Ch. (1992). PASCAL-XSC: Language Reference with Examples. Springer-Verlag. — The first language with built-in exact dot product.

9.2 Hardware Implementations

Biancolin, D., Katerman, J., Lee, H., and Wawrzynek, J. (2005). “A Hardware Accelerator for Computing an Exact Dot Product.” Proceedings of the 17th IEEE Symposium on Computer Arithmetic (ARITH-17).
Uguen, Y. and de Dinechin, F. (2017). “Design-space exploration for the Kulisch accumulator.” HAL archives-ouvertes, hal-01488916v2.

9.3 Compensated Algorithms (Alternative Approaches)

Ogita, T., Rump, S. M., and Oishi, S. (2005). “Accurate sum and dot product.” SIAM J. Sci. Comput., 26(6), 1955–1988.
Rump, S. M. (2011). “Error estimation of floating-point summation and dot product.” BIT Numerical Mathematics, 52(2), 541–562.

9.4 Posit Arithmetic and the Quire

Gustafson, J. L. (2017). The End of Error: Unum Computing. CRC Press. — Introduces the quire as part of the posit number system.
Gustafson, J. L. and Yonemoto, I. T. (2017). “Beating Floating Point at its Own Game: Posit Arithmetic.” Supercomputing Frontiers and Innovations, 4(2), 71–86.

9.5 Verified Computing and Interval Arithmetic

Rump, S. M. (1999). “INTLAB - INTerval LABoratory.” Developments in Reliable Computing, Kluwer, 77–104.
Krämer, W., Bantle, A., and Grimmer, M. (2009). “Fast (Parallel) Dense Linear System Solvers in C-XSC Using Error Free Transformations and BLAS.” Springer LNCS, 5947, 230–249.

9.6 Classical Summation Error Analysis

Kahan, W. (1965). “Further remarks on reducing truncation errors.” Communications of the ACM, 8(1), 40.
Goldberg, D. (1991). “What every computer scientist should know about floating-point arithmetic.” ACM Computing Surveys, 23(1), 5–48.

10. Conclusion

The Kulisch super-accumulator is not a theoretical curiosity — it is a solved problem that the computing industry chose not to deploy. The arithmetic is exact, the hardware cost is negligible on modern silicon, and the software is available today as a header-only C++ library.

The Universal Numbers Library’s generalized quire makes this capability accessible to any computational scientist:

One header to include (<universal/number/TYPE/fdp.hpp>)
One function to call (fdp(x, y))
One rounding operation for the entire dot product
Any number system — native float/double, IEEE-configurable floats, posits, fixed-point, logarithmic, double-base

The forgotten arithmetic of Ulrich Kulisch deserves to be rediscovered. Every matrix multiply, every iterative solver, every gradient computation in every neural network is a collection of dot products. Making those dot products exact is not a luxury — it is the straightforward, elegant, and now easily accessible solution to the most fundamental problem in numerical computing.

Appendix A: Quire Size Reference Table

Number System	Configuration	Dynamic Range (bits)	Quire Width (bits)
IEEE half	`cfloat<16,5>`	94	124
IEEE single	`cfloat<32,8>`	554	584
IEEE double	`cfloat<64,11>`	4196	4226
bfloat16	`cfloat<16,8>`	530	560
Native float	`float`	554	584
Native double	`double`	4196	4226
posit<8,0>	8-bit posit	24	54
posit<16,1>	16-bit posit	112	142
posit<32,2>	32-bit posit	480	510
posit<64,3>	64-bit posit	1984	2014
fixpnt<16,8>	16-bit fixed	32	62
fixpnt<32,16>	32-bit fixed	64	94
lns<16,8>	16-bit log	256	286
lns<32,16>	32-bit log	65536	65566
dbns<16,8>	16-bit double-base	2048	2078

Appendix B: Repository Map

universal/
├── include/sw/universal/
│   ├── number/quire/
│   │   ├── quire.hpp              # Umbrella header
│   │   ├── quire_impl.hpp         # Generalized quire implementation
│   │   └── exceptions.hpp         # Quire-specific exceptions
│   ├── number/cfloat/fdp.hpp      # cfloat FDP (quire_mul, quire_resolve, fdp)
│   ├── number/posit/fdp.hpp       # posit FDP
│   ├── number/fixpnt/fdp.hpp      # fixpnt FDP
│   ├── number/lns/fdp.hpp         # lns FDP
│   ├── number/dbns/fdp.hpp        # dbns FDP
│   ├── number/integer/integer_impl.hpp  # integer normalize (quire input)
│   ├── native/fdp.hpp             # Native float/double FDP
│   └── traits/quire_traits.hpp    # Compile-time sizing for all types
├── static/quire/api/
│   ├── cfloat_fdp.cpp             # Comprehensive cfloat FDP tests
│   ├── fixpnt_fdp.cpp             # Fixed-point FDP tests
│   ├── lns_fdp.cpp                # LNS FDP tests
│   ├── dbns_fdp.cpp               # DBNS FDP tests
│   └── generalized_quire.cpp      # Cross-type quire demonstration
├── education/quire/
│   └── quires.cpp                 # Educational quire examples
├── applications/reproducibility/blas/
│   ├── l1_fdp.cpp                 # FDP vs naive dot product
│   ├── l2_fmv.cpp                 # FDP matrix-vector multiply
│   ├── l3_fmm.cpp                 # FDP matrix-matrix multiply
│   └── hilbert.cpp                # Hilbert matrix verified solver
└── include/sw/blas/
    ├── cg_fdp_solvers.hpp         # Conjugate gradient with FDP
    └── blas_l1.hpp                # BLAS Level 1 with FDP support

This document accompanies the Universal Numbers Library (github.com/stillwater-sc/universal), an open-source, header-only C++20 template library for custom arithmetic. Licensed under the MIT License.