ucalc Worked Examples
These examples demonstrate ucalc’s capabilities across precision analysis, type comparison, error forensics, quantization, and verification. Each example is self-contained and can be run directly in the ucalc REPL.
For command reference, see the main ucalc page. For step-by-step arithmetic visualization, see the dedicated guide. For AI agent integration, see the MCP server guide.
Example 1: Precision Near 1.0 — Where Posit Outshines IEEE
Section titled “Example 1: Precision Near 1.0 — Where Posit Outshines IEEE”Posit’s tapered precision allocates more fraction bits near 1.0 than IEEE float does at the same bit width. This means posit32 has a smaller epsilon (7.45e-9 vs 1.19e-7), and can resolve smaller perturbations.
double> type floatActive type: float (float (IEEE-754 binary32))float> precision type: float (IEEE-754 binary32) binary digits: 23 decimal digits: 6.9 epsilon: 1.1920929e-07 minpos: 1.40129846e-45 maxpos: 3.40282347e+38float> type posit32Active type: posit32 (posit< 32, 2, uint32_t>)posit32> precision type: posit< 32, 2, uint32_t> binary digits: 27 decimal digits: 8.1 epsilon: 7.450580597e-09 minpos: 7.523163845e-37 maxpos: 1.329227996e+36Example 2: How 0.1 Looks Across 42 Types
Section titled “Example 2: How 0.1 Looks Across 42 Types”The decimal value 0.1 cannot be represented exactly in binary floating-point.
The compare command reveals how each type approximates it, grouped by
bit width (small <=32, medium 33-80, large >80):
double> compare 1/10Type Value Binary----------------------------------------------------------------------float 0.100000001 0b0.01111011.10011001100110011001101posit8 1.02e-01 0b0.01.00.101posit16 1.0001e-01 0b0.01.00.10011001101posit32 1.000000001e-01 0b0.01.00.100110011001100110011001101bfloat16 0.1 0x0.01111011.1001100fp16 9.9976e-02 0b0.01011.1001100110fp32 9.99999940e-02 0b0.01111011.10011001100110011001100fp8e2m5 0.00e+00 0b0.00.00000fp8e3m4 9.38e-02 0b0.000.0110fp8e4m3 1.02e-01 0b0.0011.101fp8e5m2 9.4e-02 0b0.01011.10fixpnt16 0.10156250 0b00000000.00011010fixpnt32 0.1000061035156250 0b0000000000000000.0001100110011010lns8 0.10511 0b0.11100.11int8 0 0b00000000int16 0 0b0000000000000000int32 0 0b00000000000000000000000000000000takum8 0.125 0b0.0.110.0.00takum16 0.099976 0b0.0.101.11.100110011takum32 0.09999999963 0b0.0.101.11.1001100110011001100110011hfloat32 0.099999964 0b0.1000000.000110011001100110011001decimal32 0.1000000 0b0.01001.011110.00000000000000000000rational8 0.1 0b0000'0001 / 0b0000'1010rational16 0.1 0b0000'0000'0000'0001 / 0b0000'0000'0000'1010rational32 0.1 0b0000'0000'0000'0000'0000'0000'0000'0001 / 0b0000'0000'0000'0000'0000'0000'0000'1010
Type Value Binary--------------------------------------------------------------------------------double 0.10000000000000001 0b0.01111111011.1001100110011001100110011001100110011001100110011010posit64 1.0000000000000000002e-01 0b0.01.00.10011001100110011001100110011001100110011001100110011001101fp64 9.99999999999999917e-02 0b0.01111111011.1001100110011001100110011001100110011001100110011001lns16 0.100112047230168338396 0b0.1111100.10101110int64 0 0b0000000000000000000000000000000000000000000000000000000000000000takum64 0.10000000000000000555 0b0.0.101.11.100110011001100110011001100110011001100110011001101000000dfixpnt8_4 0.1000 0.0000000000000000.0001000000000000dfixpnt16_8 0.10000000 0.00000000000000000000000000000000.00010000000000000000000000000000hfloat64 0.099999999999999992 0b0.1000000.00011001100110011001100110011001100110011001100110011001decimal64 0.1000000000000000 0b0.01001.01111110.00000000000000000000000000000000000000000000000000
Type Value / Binary--------------------------------------------------------------------------------fp128 9.99999999999999999999999999999999928e-02 0b0.011111111111011.1001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001100110011001lns32 0.09999987268604650092473917766255908645689487457275390625 0b0.111111111111100.1010110110010110dd 1.0000000000000000000000000000000e-01 0b0.01111111011.1001100110011001100110011001100110011001100110011010|01100110011001100110011001100110011001100110011001101dd_cascade 9.999999999999999999999999999999969e-02 0b0.01111111011.1001100110011001100110011001100110011001100110011010|01100110011001100110011001100110011001100110011001101td_cascade 1.0000000000000000000000000000000000000000000000002e-01 0b0.01111111011.1001100110011001100110011001100110011001100110011010|00000000000000000000000000000000000000000000000000000|00000000000000000000000000000000000000000000000000000qd 1.000000000000000000000000000000000000000000000000000000000000000e-01 0b0.01111111011.1001100110011001100110011001100110011001100110011010|00000000000000000000000000000000000000000000000000000|00000000000000000000000000000000000000000000000000000|00000000000000000000000000000000000000000000000000000qd_cascade 9.99999999999999999999999999999999999999999999999999999999999999991e-02 0b0.01111111011.1001100110011001100110011001100110011001100110011010|00000000000000000000000000000000000000000000000000000|00000000000000000000000000000000000000000000000000000|00000000000000000000000000000000000000000000000000000Notice that decimal32 and rational32 represent 0.1 exactly — decimal
uses base-10 encoding and rational stores the fraction 1/10 directly. Every
binary type introduces rounding error, but the magnitude differs by orders
of magnitude across types.
Now try:
double> compare 0.1The results are different as we are giving the number systems a double precision
floating-point number, 0.1, which approximates 1/10, and that approximation
propagates through all the number systems.
Example 3: The Golden Ratio Identity — Measuring Arithmetic Fidelity
Section titled “Example 3: The Golden Ratio Identity — Measuring Arithmetic Fidelity”The golden ratio satisfies phi^2 - phi - 1 = 0. With native-precision arithmetic, each type reveals its true residual:
posit32> x = phi1.618033990e+00posit32> show x * x - x - 1 value: 0.000000000e+00 binary: 0b0.0000000000000000000000000000000.. components: sign: +, regime: -31, exponent: 1, significand: 1 type: posit< 32, 2, uint32_t>Posit32 evaluates to exactly zero (lucky cancellation at this precision). IEEE single shows a residual of one ULP:
fp32> y = phi1.61803401e+00fp32> show y * y - y - 1 value: 1.19209290e-07 binary: 0b0.01101000.00000000000000000000000 components: sign: +, scale: -23, significand: 1.000000000e+00 type: fp32 (IEEE-754 binary32)Double-double reveals a residual at its own machine epsilon (~1e-33):
dd> z = phi1.6180339887498948482045868343656e+00dd> show z * z - z - 1 value: -6.1629758220391547297791294162718e-33 binary: 0b1.01110010100.000...0|000...0 components: double-double: -6.16298e-33 type: double-doublePosit32 is lucky in this expression as the rounding of phi and phi^2 are in the same direction and yield values exactly 1.0 apart:
Active type: posit32 (posit< 32, 2, uint32_t>)posit32> x = phi1.618033990e+00posit32> xsqr = phi * phi2.618033990e+00posit32> vars posit32 x = 1.618033990e+00 posit32 xsqr = 2.618033990e+00
posit32> show x value: 1.618033990e+00 color: 01000100111100011011101111001110 components: sign: +, regime: 0, exponent: 1, significand: 1.61803399026393890381 type: posit< 32, 2, uint32_t>posit32> show xsqr value: 2.618033990e+00 color: 01001010011110001101110111100111 components: sign: +, regime: 0, exponent: 2, significand: 1.3090169951319694519 type: posit< 32, 2, uint32_t>posit32> show xsqr - x value: 1.000000000e+00 color: 01000000000000000000000000000000 components: sign: +, regime: 0, exponent: 1, significand: 1 type: posit< 32, 2, uint32_t>The Priest-based dd_cascade shows the same dynamic:
posit32> type dd_cascadeActive type: dd_cascade (double-double Priest)dd_cascade> y = phi1.618033988749894848204586834365637e+00dd_cascade> ysqr = phi * phi2.618033988749895023991527441632691e+00dd_cascade> vars dd_cascade y = 1.618033988749894848204586834365637e+00 dd_cascade ysqr = 2.618033988749895023991527441632691e+00
dd_cascade> show y value: 1.618033988749894848204586834365637e+00 color: dd_cascade[ high: 1.61803, low: -5.43212e-17 ] components: double-double Priest: 1.61803 type: double-double Priestdd_cascade> show ysqr value: 2.618033988749895023991527441632691e+00 color: dd_cascade[ high: 2.61803, low: 1.21466e-16 ] components: double-double Priest: 2.61803 type: double-double Priestdd_cascade> show ysqr - y value: 1.000000000000000000000000000000000e+00 color: dd_cascade[ high: 1, low: 0 ] components: double-double Priest: 1 type: double-double PriestExample 4: Dynamic Range Comparison Across 16-bit Types
Section titled “Example 4: Dynamic Range Comparison Across 16-bit Types”The range command reveals how different 16-bit types trade precision for
range:
fp16> rangefp16 (IEEE-754 binary16)[ -6.5504e+04 ... -5.9605e-08 0 5.9605e-08 ... 6.5504e+04 ]
bfloat16> rangebfloat16[ -3.4e+38 ... -1.2e-38 0 1.2e-38 ... 3.4e+38 ]
posit16> rangeposit< 16, 2, uint16_t>[ -7.2058e+16 ... -1.3878e-17 0 1.3878e-17 ... 7.2058e+16 ]
lns16> rangelns< 16, 8, uint16_t, Saturating>[ -18396865112328554496 ... -5.436e-20 0 5.436e-20 ... 18396865112328554496 ]All four are 16 bits, but their tradeoffs are dramatic:
| Type | Dynamic Range (decades) | Precision (digits) |
|---|---|---|
| fp16 | ~13 decades | 3.0 |
| bfloat16 | ~76 decades | 2.1 |
| posit16 | ~34 decades | 3.3 |
| lns16 | ~39 decades | ~2.4 |
bfloat16 matches float’s range but sacrifices precision. posit16 delivers more precision than fp16 AND more range. lns16 achieves the widest range of any 16-bit format by encoding values as logarithms.
Example 5: Precision Ladder — From 8-bit to 32-bit
Section titled “Example 5: Precision Ladder — From 8-bit to 32-bit”The precision command measures each type’s effective precision at 1.0.
This reveals how bit-width translates to decimal accuracy:
fp8e4m3> precision type: fp8e4m3 (OFP 8-bit e4m3) binary digits: 3 decimal digits: 0.9 epsilon: 1.25e-01 minpos: 1.95e-03 maxpos: 4.16e+02
posit16> precision type: posit< 16, 2, uint16_t> binary digits: 11 decimal digits: 3.3 epsilon: 4.8828e-04 minpos: 1.3878e-17 maxpos: 7.2058e+16
fp32> precision type: fp32 (IEEE-754 binary32) binary digits: 23 decimal digits: 6.9 epsilon: 1.19209290e-07 minpos: 1.40129846e-45 maxpos: 3.40282347e+38
posit32> precision type: posit< 32, 2, uint32_t> binary digits: 27 decimal digits: 8.1 epsilon: 7.450580597e-09 minpos: 7.523163845e-37 maxpos: 1.329227996e+36At 32 bits, posit delivers 27 binary digits near 1.0 compared to IEEE float’s 23 — a 16x smaller epsilon. This means posit32 can resolve perturbations near 1.0 that fp32 cannot (see Example 1).
Example 6: Catastrophic Cancellation
Section titled “Example 6: Catastrophic Cancellation”Subtracting nearly equal quantities destroys significant digits. The
expression (1 + 1e-8) - 1 should yield 1e-8 but exercises catastrophic
cancellation:
fp32> show (1 + 1e-8) - 1 value: 0.00000000e+00 binary: 0b0.00000000.00000000000000000000000 components: sign: +, zero type: fp32 (IEEE-754 binary32)IEEE single loses the 1e-8 term entirely — it’s below the ULP at 1.0 (which is ~1.2e-7). posit32 preserves the term:
posit32> show (1 + 1e-8) - 1 value: 7.450580597e-09 binary: 0b0.00000001.01.000000000000000000000 components: sign: +, regime: -7, exponent: 2, significand: 1 type: posit< 32, 2, uint32_t>The posit result (7.45e-9) is the nearest posit representable to 1e-8. Double-double recovers nearly full accuracy:
dd> show (1 + 1e-8) - 1 value: 1.0000000000000000209225608301285e-08 binary: 0b0.01111100100.0101011110011000111011100010001100001000110000111010|000...0 components: double-double: 1e-08 type: double-doubleThis pattern is critical in numerical methods where loss-of-significance in intermediate results cascades into large final errors.
Example 7: Faithful Rounding Verification
Section titled “Example 7: Faithful Rounding Verification”A result is faithfully rounded if it equals one of the two representable
values adjacent to the exact answer. The faithful command checks this
against a quad-double reference:
posit32> faithful sqrt(2) result: 1.414213561e+00 reference: 1.414213562373095048801688724209698...e+00 rounded: 1.414213561e+00 neighbor: 1.414213568e+00 faithful: yes
fp32> faithful sqrt(2) result: 1.41421354e+00 reference: 1.414213562373095048801688724209698...e+00 rounded: 1.41421354e+00 neighbor: 1.41421366e+00 faithful: yesBoth posit32 and fp32 produce faithfully rounded sqrt(2). The rounded
value is the nearest representable, and neighbor is the next representable
in the opposite direction. The result must equal one of them.
Example 8: Transcendental Error Profiles with Sweep
Section titled “Example 8: Transcendental Error Profiles with Sweep”The sweep command evaluates an expression across a range and reports
ULP error vs a double-precision reference. This reveals where type-specific
approximation errors concentrate:
posit32> sweep sin(x) for x in [0, 3.14159, 6]x result double ref ULP error-------------------------------------------------------------------------------------0 0.000000000e+00 0 0.000.628318 5.877848230e-01 0.58778482293254253 0.041.256636 9.510561898e-01 0.95105618829288086 0.441.884954 9.510570094e-01 0.95105700829655349 0.312.513272 5.877869688e-01 0.58778696973054001 0.853.14159 2.654390983e-06 2.65358979335273e-06 20261.95The error is sub-ULP through most of the range but explodes near pi where argument reduction subtracts nearly equal quantities. This is a fundamental limitation shared by all binary types — the result near sin(pi) depends on how many digits of pi the type can represent.
Example 9: Exact Decimal Arithmetic for Financial Calculations
Section titled “Example 9: Exact Decimal Arithmetic for Financial Calculations”Binary floating-point cannot represent 0.1, 0.01, or most decimal
fractions exactly. In financial software this causes accumulation errors
that violate accounting identities. Decimal fixed-point (dfixpnt)
uses BCD encoding and carries every decimal digit without rounding:
double> show 0.1 + 0.2 - 0.3 value: 5.5511151231257827e-17 binary: 0b0.01111001001.0000000000000000000000000000000000000000000000000000 components: sign: +, scale: -54, significand: 1 type: double (IEEE-754 binary64)
double> type dfixpnt16_8Active type: dfixpnt16_8 (dfixpnt< 16, 8, BCD, Modulo, uint8_t>)dfixpnt16_8> show 0.1 + 0.2 - 0.3 value: 0.00000000 binary: 0.00000000000000000000000000000000.00000000000000000000000000000000 components: dfixpnt< 16, 8, BCD, Modulo, uint8_t>: 0 type: dfixpnt< 16, 8, BCD, Modulo, uint8_t>Double produces a non-zero residual (~5.55e-17) because 0.1 and 0.2
are rounded on entry. dfixpnt16_8 yields exactly zero.
This matters when totals must balance to the penny. Consider an invoice with three items at $19.99, two at $5.99, and one at $1.50, plus 7.25% sales tax:
dfixpnt16_8> tax = 0.07250.07250000dfixpnt16_8> subtotal = 19.99 * 3 + 5.99 * 2 + 1.5073.45000000dfixpnt16_8> show subtotal + subtotal * tax value: 78.77512500 binary: 0.00000000000000000000000001111000.01110111010100010010010100000000 components: dfixpnt< 16, 8, BCD, Modulo, uint8_t>: 78.7751 type: dfixpnt< 16, 8, BCD, Modulo, uint8_t>The subtotal is exactly $73.45, tax is exactly $5.325125, and the grand total is exactly $78.775125. The same calculation in double:
double> tax = 0.07250.072499999999999995double> subtotal = 19.99 * 3 + 5.99 * 2 + 1.5073.450000000000003double> show subtotal + subtotal * tax value: 78.775125000000003 binary: 0b0.10000000101.0011101100011001101110100101111000110101001111111000 components: sign: +, scale: 6, significand: 1.230861328125 type: double (IEEE-754 binary64)Double’s subtotal is already 73.450000000000003 — off by 3e-15. These errors are invisible in a single calculation but accumulate across thousands of line items in a ledger, eventually causing reconciliation failures. Decimal fixed-point eliminates this class of error entirely.
Example 10: Takum’s Uniform Precision Across the Dynamic Range
Section titled “Example 10: Takum’s Uniform Precision Across the Dynamic Range”Posit arithmetic concentrates precision near 1.0 by using a variable-length regime field: values close to 1.0 get many fraction bits, but extreme values consume most bits on the regime, leaving few for the significand. Takum (Hunhold, 2024) replaces the variable-length regime with a bounded characteristic field, giving a more uniform precision distribution and a dramatically wider dynamic range.
At 32 bits, both types deliver identical precision near 1.0:
takum32> precision type: takum< 32, 3, uint32_t> binary digits: 27 decimal digits: 8.1 epsilon: 7.450580597e-09 minpos: 1.727235358e-77 maxpos: 5.789601701e+76
posit32> precision type: posit< 32, 2, uint32_t> binary digits: 27 decimal digits: 8.1 epsilon: 7.450580597e-09 minpos: 7.523163845e-37 maxpos: 1.329227996e+36Same epsilon, same 27 binary digits at 1.0. But takum32 spans 10^77 while posit32 reaches only 10^36 — over twice the dynamic range in decades.
The difference becomes dramatic away from 1.0. Compare the ULP at increasing scales:
| Scale | takum32 ULP | posit32 ULP | Relative ULP (takum) | Relative ULP (posit) |
|---|---|---|---|---|
| 1 | 7.45e-9 | 7.45e-9 | 7.45e-9 | 7.45e-9 |
| 1e5 | 5.96e-3 | 5.96e-3 | 5.96e-8 | 5.96e-8 |
| 1e10 | 1,192 | 9,537 | 1.19e-7 | 9.54e-7 |
| 1e15 | 1.19e8 | 1.53e10 | 1.19e-7 | 1.53e-5 |
| 1e20 | 2.38e13 | 2.44e16 | 2.38e-7 | 2.44e-4 |
| 1e30 | 2.38e23 | 6.44e28 | 2.38e-7 | 6.44e-2 |
Takum’s relative precision stays nearly constant (~2e-7) across 30 decades of scale. Posit’s degrades from 7.45e-9 at 1.0 to 0.064 at 1e30 — a factor of 8.6 million. At 1e30, posit32 has barely one significant digit left.
This is visible in the representations themselves:
takum32> show 1e20 value: 1.00000002e+20 binary: 0b0.1.110.000011.010110101111000111011 components: ... Characteristic : 66 Scale : 66 type: takum< 32, 3, uint32_t>
posit32> show 1e20 value: 1.000159405e+20 binary: 0b0.111111111111111110.10.01011011000 components: sign: +, regime: 16, exponent: 4, significand: 1.35546875 type: posit< 32, 2, uint32_t>Takum32 represents 1e20 to 8 significant digits (1.00000002e+20). Posit32 manages only 4 (1.000159405e+20 — off by 1.6e16). The posit’s regime field has expanded to 18 bits, leaving only 11 for exponent and significand. Takum’s characteristic field stays bounded, preserving fraction bits at every scale.
A sweep of sqrt(x) across a wide range confirms the pattern:
takum16> sweep sqrt(x) for x in [0.001, 1e12, 8]x result double ref ULP error-------------------------------------------------------------------------------------0.001 0.031616 0.031622776601683791 0.211.4285714e+11 3.7888e+05 377964.47300922842 0.314.2857143e+11 6.5536e+05 654653.67070797761 0.148.5714286e+11 9.257e+05 925820.09977255156 0.031e+12 9.9942e+05 1000000.0000000001 0.29
posit16> sweep sqrt(x) for x in [0.001, 1e12, 8]x result double ref ULP error-------------------------------------------------------------------------------------0.001 3.1616e-02 0.031622776601683791 0.431.4285714e+11 3.7069e+05 377964.47300922842 2.464.2857143e+11 6.4307e+05 654653.67070797761 2.268.5714286e+11 9.0931e+05 925820.09977255156 4.561e+12 9.7894e+05 1000000.0000000001 10.78At 1e12, posit16’s sqrt has 10.78 ULP error vs takum16’s 0.29 — a 37x improvement. Takum maintains sub-ULP accuracy across the entire range because it never runs out of fraction bits.
Example 11: Tracing Error Propagation
Section titled “Example 11: Tracing Error Propagation”The trace command shows each arithmetic operation with its ULP error and
rounding direction, using quad-double as the reference.
float> trace 1/3 + 1/3 + 1/3 step 1: 1 / 3 result: 0.333333343 reference: 3.333333333...e-01 ROUNDED UP 0.50 ULP step 3: 0.333333343 + 0.333333343 = 0.666666687 (exact) step 5: 0.666666687 + 0.333333343 result: 1 reference: 1.0000000298... ROUNDED DOWN 0.25 ULP result: 1 reference precision: quad-doubleExample 12: Rounding Audit with Cumulative Drift
Section titled “Example 12: Rounding Audit with Cumulative Drift”The audit command tracks signed ULP error and detects ties-to-even rounding.
float> audit 1/3 + 1/3 + 1/3 step 1: TIES-TO-EVEN ulp: +0.50 cumulative: +0.50 step 2: TIES-TO-EVEN ulp: +0.50 cumulative: +1.00 step 3: exact step 4: TIES-TO-EVEN ulp: +0.50 cumulative: +1.50 step 5: ROUNDED DOWN ulp: -0.25 cumulative: +1.25 -------- rounding events: 4 of 5 operations max |ulp| error: 0.50 cumulative drift: +1.25 ULPsExample 13: Quantization Quality for ML Weights
Section titled “Example 13: Quantization Quality for ML Weights”Compare quantization quality across formats using QSNR (dB):
for fmt in fp8e4m3 fp8e5m2 bfloat16 posit8 fp16; do echo -n "$fmt: "; ucalc --quiet "quantize $fmt -f weights.csv"donefp8e4m3: 0.0131171 31.6dB 10000fp8e5m2: 0.0267517 25.4dB 10000bfloat16: 0.00165137 49.6dB 10000posit8: 0.0131249 31.6dB 10000fp16: 0.000103695 73.7dB 10000Example 14: Precision Heatmap
Section titled “Example 14: Precision Heatmap”posit16> heatmap magnitude sig_bits bar 1e-12 2.0 ###### 1e-8 6.0 #################### 1e-4 9.0 ############################## 1e+0 11.0 #################################### 1e+4 9.0 ############################## 1e+8 6.0 #################### 1e+12 3.0 ##########
tapered precision: peaks near 1, falls off at extremesExample 15: Finding Type Divergence Points
Section titled “Example 15: Finding Type Divergence Points”ucalc> diverge sin(x) posit32 float 1ulp for x in [0, 6.28] first divergence at x = 0.003198... posit32 3.198280232e-03 float 0.00319828046 abs diff: 2.3283064e-10 ulp diff: 4.885 ULPsExample 16: Rewrite Suggestions with Verification
Section titled “Example 16: Rewrite Suggestions with Verification”The suggest command identifies numerically unstable patterns in expressions
and proposes stable alternatives, verified with actual error comparison.
float> a = 1000001; b = 1000000float> suggest sqrt(a) - sqrt(b) pattern: Square root difference (sqrt_diff) matched: (sqrt(a) - sqrt(b)) alternative: ((a - b) / (sqrt(a) + sqrt(b))) condition: a, b exact inputs, a ~= b > 0 original: 0.00048828125 (rel error: 2.3437e-02) rewritten: 0.000499999849 (rel error: 5.1749e-08) VERIFIED: 452905.6x betterThe rewrites command lists all 7 available patterns:
ucalc> rewrites 1. Square root difference (sqrt_diff) sqrt(a) - sqrt(b) -> (a - b) / (sqrt(a) + sqrt(b)) 2. Quadratic formula (unstable root) (-b + sqrt(b^2 - 4*a*c)) / (2*a) -> 2*c / (-b - sqrt(b^2 - 4*a*c)) 3. Logarithm near 1: log(1 + x) -> log1p(x) 4. Exponential minus 1: exp(x) - 1 -> expm1(x) 5. One minus cosine: 1 - cos(x) -> 2 * sin(x/2)^2 6. Sine difference: sin(a) - sin(b) -> product-to-sum 7. Cosine deviation ratioExample 17: Expression Tree with Provenance
Section titled “Example 17: Expression Tree with Provenance”The ast command shows the expression tree structure with provenance tags
indicating which values are exact inputs vs computed intermediates.
ucalc> ast (-b + sqrt(b^2 - 4*a*c)) / (2*a) `-- op:/ [computed] |-- op:+ [computed] | |-- unary:negate [computed] | | `-- var:b [exact] | `-- fn:sqrt [computed] | `-- op:- [computed] | |-- op:^ [computed] | | |-- var:b [exact] | | `-- 2 [exact] | `-- op:* [computed] | |-- op:* [computed] | | |-- 4 [exact] | | `-- var:a [exact] | `-- var:c [exact] `-- op:* [computed] |-- 2 [exact] `-- var:a [exact]Example 18: Oracle — Canonical Type Results
Section titled “Example 18: Oracle — Canonical Type Results”The oracle command gives the authoritative result for any expression in any
type, with rounding verification against a quad-double reference.
ucalc> oracle posit32 sin(0.1) type: posit< 32, 2, uint32_t> expression: sin(0.1) value: 9.983341675e-02 binary: 0b0.01.00.100110001110101011101100110 reference: 9.983341664682815...e-02 rounding: correctly rounded (nearest)
ucalc> oracle decimal32 0.1 + 0.2 type: dfloat< 7, 6, BID, uint32_t> value: 0.3 reference: 3.000000000...e-01 rounding: correctly rounded (nearest)Example 19: Stochastic Rounding Simulation
Section titled “Example 19: Stochastic Rounding Simulation”The stochastic command simulates stochastic rounding over N trials to
understand rounding bias.
bfloat16> stochastic 0.1 + 0.2 10000 unique results: 2 0.2988 4024 (40.2%) 0.3008 5976 (59.8%) mean: 0.2999953125 exact: 3.000000000...e-01 bias: -4.6875e-06The near-zero bias confirms stochastic rounding is unbiased — the mean converges to the exact value over many trials.
Example 20: Error Distribution Analysis
Section titled “Example 20: Error Distribution Analysis”The errordist command evaluates a function at many points and histograms
the ULP error distribution.
posit32> errordist sin(x) for x in [0, 6.28, 1000] ulp_error count bar 0 1 (0, 0.5] 391 ######################################## (0.5, 1] 189 ################### (1, 2] 158 ################ (2, 4] 118 ############ (4, 8] 68 ###### (8, +) 75 #######
max ULP: 176.29 at x = 3.1431431 mean ULP: 2.97 faithful: 58.1% (581/1000)Example 21: Condition Number Estimation
Section titled “Example 21: Condition Number Estimation”The cond command estimates the condition number of small matrices to
predict precision loss.
float> cond [[1, 2], [1.0001, 2]] condition (1-norm): 59992.04 determinant: -0.00020003319 WARNING: ill-conditioned type precision: ~6.9 decimal digits digits lost: ~4.8 effective precision: ~2.1 decimal digitsExample 22: Test Vector Generation
Section titled “Example 22: Test Vector Generation”The testvec command generates golden reference vectors for regression tests,
directly pasteable into C++ test code.
ucalc> testvec posit16 sin [0, 3.14159, 5]// Golden reference vectors for sin(x) in posit< 16, 2, uint16_t>// Generated by ucalc testvecstruct TestVector { double input; double expected; };constexpr TestVector sin_posit16[] = { { 0, 0 }, // 0.0000e+00 { 0.785397, 0.707031 }, // 7.0703e-01 { 1.57079, 1 }, // 1.0000e+00 { 2.35619, 0.707031 }, // 7.0703e-01 { 3.14159, -8.8811e-06 } // -8.8811e-06};