Skip to content

Mixed-Precision Algorithm Design Utilities

This document describes the utilities in include/sw/universal/utility/ that support mixed-precision algorithm design and optimization. These tools help you analyze value ranges, estimate energy consumption, explore accuracy/energy trade-offs, and generate optimized precision configurations.

  1. Overview
  2. Core Utilities
  3. Workflow Guide
  4. Complete Example

Mixed-precision computing uses different numerical precisions for different parts of an algorithm to optimize:

  • Energy efficiency: Lower-precision operations consume less power
  • Memory bandwidth: Smaller data types reduce memory traffic
  • Performance: More values fit in cache; SIMD lanes can process more elements

The challenge is selecting the right precision for each variable without sacrificing accuracy.

The Universal library provides a complete SDK for mixed-precision algorithm design:

┌─────────────────────────────────────────────────────────────────┐
│ Mixed-Precision Workflow │
├─────────────────────────────────────────────────────────────────┤
│ 1. Profile → 2. Analyze → 3. Explore → 4. Generate │
│ ─────────── ────────── ────────── ─────────── │
│ occurrence range_analyzer pareto_explorer config_gen │
│ scale_tracker type_advisor algorithm_prof │
│ instrumented memory_profiler │
└─────────────────────────────────────────────────────────────────┘

Purpose: Count arithmetic operations (add, sub, mul, div, sqrt) during algorithm execution to estimate compute costs.

What it does: A simple struct that tracks operation counts for a specific number type. Used as a building block for energy estimation.

Header: #include <universal/utility/occurrence.hpp>

#include <universal/utility/occurrence.hpp>
using namespace sw::universal;
// Create occurrence tracker
occurrence<float> ops;
// Update counts (typically done by instrumented types)
ops.add = 1000;
ops.mul = 2000;
ops.div = 100;
// Report counts
ops.report(std::cout);

Key fields:

  • load, store: Memory operations
  • add, sub, mul, div, rem: Arithmetic operations
  • sqrt: Square root operations

Use case: Foundation for energy estimation when combined with energy cost models.


scale_tracker.hpp - Exponent Distribution Tracking

Section titled “scale_tracker.hpp - Exponent Distribution Tracking”

Purpose: Track the distribution of exponents (scales) in computed values to understand dynamic range requirements.

What it does: Maintains a histogram of exponent values, counting how many values fall into each exponent bucket. Also tracks underflows and overflows.

Header: #include <universal/utility/scale_tracker.hpp>

#include <universal/utility/scale_tracker.hpp>
using namespace sw::universal;
// Track scales from 2^-10 to 2^10
scaleTracker tracker(-10, 10);
// During computation, record scales
for (double value : computed_values) {
int scale = ilogb(value); // Extract exponent
tracker.incr(scale);
}
// Report distribution
tracker.report(std::cout);
// Output shows histogram + underflow/overflow counts

Key methods:

  • incr(scale): Increment count for given exponent
  • clear(): Reset counts
  • report(ostream): Print histogram

Use case: Determine the required exponent range for your algorithm. If most values cluster in a narrow range, you can use a type with fewer exponent bits.


Purpose: Comprehensive analysis of value distributions to recommend appropriate precision.

What it does: Observes values during computation and collects statistics including min/max values, scale range, subnormal counts, and special values (inf, nan). Generates precision recommendations.

Header: #include <universal/utility/range_analyzer.hpp>

#include <universal/utility/range_analyzer.hpp>
using namespace sw::universal;
// Create analyzer for your computation type
range_analyzer<double> analyzer;
// Observe values during computation
for (const auto& result : computation_results) {
analyzer.observe(result);
}
// Get comprehensive report
analyzer.report(std::cout);
// Get precision recommendation
PrecisionRecommendation rec = analyzer.recommendPrecision();
std::cout << "Suggested type: " << rec.type_suggestion << "\n";
std::cout << "Min exponent bits: " << rec.min_exponent_bits << "\n";
std::cout << "Min fraction bits: " << rec.min_fraction_bits << "\n";

Key methods:

  • observe(value): Record a single value
  • observe(begin, end): Record a range of values
  • statistics(): Get RangeStatistics struct
  • minScale(), maxScale(): Get exponent range
  • dynamicRangeUtilization(): Fraction of type’s range actually used
  • recommendPrecision(): Get PrecisionRecommendation

Output includes:

  • Value classification (zeros, normals, denormals, inf, nan)
  • Signed value distribution
  • Value range (min, max, min absolute, max absolute)
  • Scale span (number of decades)
  • Type recommendation with rationale

Use case: Profile your algorithm’s intermediate values to understand what precision is actually needed vs. what you’re using.


Purpose: Recommend optimal Universal number types based on range analysis and accuracy requirements.

What it does: Takes range analysis results and accuracy requirements, then scores and ranks all available Universal types (cfloat, posit, fixpnt, lns) to recommend the best match.

Header: #include <universal/utility/type_advisor.hpp>

#include <universal/utility/type_advisor.hpp>
using namespace sw::universal;
// First analyze your data
range_analyzer<double> analyzer;
for (auto v : data) analyzer.observe(v);
// Create type advisor
TypeAdvisor advisor;
// Specify accuracy requirements
AccuracyRequirement accuracy;
accuracy.relative_error = 1e-4; // 0.01% tolerance
accuracy.require_inf = false; // Don't need infinity
accuracy.require_nan = false; // Don't need NaN
// Get recommendations
auto recommendations = advisor.recommend(analyzer, accuracy);
// Print report
advisor.report(std::cout, analyzer, accuracy);
// Get best recommendation
TypeRecommendation best = advisor.bestType(analyzer, accuracy);
std::cout << "Best type: " << best.type.name << "\n";
std::cout << "Score: " << best.suitability_score << "%\n";
std::cout << "Energy factor: " << best.estimated_energy << "x\n";

Key structures:

TypeCharacteristics: Properties of each number type

  • name, family, total_bits, exponent_bits, fraction_bits
  • max_value, min_positive, epsilon
  • energy_per_fma (picojoules estimate)
  • has_subnormals, has_inf, has_nan

TypeRecommendation: Scored recommendation

  • type: The TypeCharacteristics
  • suitability_score: 0-100 (higher = better match)
  • meets_accuracy, meets_range: Boolean checks
  • estimated_energy: Relative to FP32
  • rationale: Why this type was recommended

Known types in database:

  • IEEE: half, bfloat16, float, double
  • Posit: posit<8,0>, posit<16,1>, posit<32,2>, posit<64,3>
  • Fixed-point: fixpnt<16,8>, fixpnt<32,16>
  • LNS: lns<16,8>, lns<32,16>

Use case: When you know your accuracy requirements, get an objective ranking of which Universal types best fit your needs.


memory_profiler.hpp - Memory Access Profiling

Section titled “memory_profiler.hpp - Memory Access Profiling”

Purpose: Model memory access patterns and estimate memory-related energy costs.

What it does: Tracks memory reads and writes, estimates cache behavior based on working set size, and calculates energy costs across the memory hierarchy (L1, L2, L3, DRAM).

Header: #include <universal/utility/memory_profiler.hpp>

#include <universal/utility/memory_profiler.hpp>
using namespace sw::universal;
// Create profiler with cache configuration
MemoryProfiler profiler(CacheConfig::intel_skylake());
// Record memory accesses
size_t matrix_bytes = M * N * sizeof(float);
profiler.recordRead(matrix_bytes, AccessPattern::Sequential);
profiler.recordWrite(result_bytes, AccessPattern::Sequential);
profiler.setWorkingSetSize(total_working_set);
// Get report
profiler.report(std::cout);
// Get energy estimate
double energy_pj = profiler.estimateEnergyPJ();
double energy_uj = profiler.estimateEnergyUJ();
// Check cache tier
MemoryTier tier = profiler.estimatePrimaryTier();
// Returns: L1_Cache, L2_Cache, L3_Cache, or DRAM

Pre-built BLAS profiles:

// Profile common linear algebra operations
auto gemm_profile = profileGEMM(M, N, K, sizeof(float));
auto dot_profile = profileDotProduct(N, sizeof(float));
auto gemv_profile = profileGEMV(M, N, sizeof(float));

Access patterns:

  • AccessPattern::Sequential: Linear traversal (best cache utilization)
  • AccessPattern::Strided: Regular stride (e.g., column access)
  • AccessPattern::Random: Irregular access (worst case)
  • AccessPattern::Reuse: Repeated access to same data

Cache configurations:

  • CacheConfig::intel_skylake(): 32KB L1, 256KB L2, 8MB L3
  • CacheConfig::arm_cortex_a76(): 64KB L1, 512KB L2, 4MB L3
  • CacheConfig::arm_cortex_a55(): 32KB L1, 128KB L2, 2MB L3

Use case: Estimate memory energy costs for your algorithm and understand if you’re memory-bound or compute-bound.


algorithm_profiler.hpp - Unified Algorithm Profiling

Section titled “algorithm_profiler.hpp - Unified Algorithm Profiling”

Purpose: Combine operation counting, memory profiling, and energy estimation into a single analysis framework.

What it does: Provides AlgorithmProfile structs that capture all dimensions of algorithm cost: operation counts, memory access, value ranges, and energy estimates.

Header: #include <universal/utility/algorithm_profiler.hpp>

#include <universal/utility/algorithm_profiler.hpp>
using namespace sw::universal;
// Profile a GEMM operation
AlgorithmProfile profile = AlgorithmProfiler::profileGEMM(
1024, 1024, 1024, // M, N, K dimensions
"float", 32, // precision name, bit width
CacheConfig::intel_skylake()
);
// Print comprehensive report
profile.report(std::cout);
// Access individual metrics
std::cout << "FMAs: " << profile.fmas << "\n";
std::cout << "Working set: " << profile.working_set_bytes << " bytes\n";
std::cout << "Arithmetic intensity: " << profile.ops_per_byte << " ops/byte\n";
std::cout << "Compute energy: " << profile.compute_energy_pj << " pJ\n";
std::cout << "Memory energy: " << profile.memory_energy_pj << " pJ\n";
std::cout << "Total energy: " << profile.total_energy_pj << " pJ\n";

Pre-built profilers:

// Dense linear algebra
auto gemm = AlgorithmProfiler::profileGEMM(M, N, K, "float", 32);
auto gemv = AlgorithmProfiler::profileGEMV(M, N, "float", 32);
auto dot = AlgorithmProfiler::profileDotProduct(N, "float", 32);
// Deep learning
auto conv = AlgorithmProfiler::profileConv2D(
224, 224, // H, W
3, 64, // C_in, C_out
3, // kernel size
"float", 32
);

Comparing precisions:

auto fp32 = AlgorithmProfiler::profileGEMM(1024, 1024, 1024, "float", 32);
auto fp16 = AlgorithmProfiler::profileGEMM(1024, 1024, 1024, "half", 16);
PrecisionComparison cmp = AlgorithmProfiler::compare(fp32, fp16);
cmp.report(std::cout);
// Shows: ops ratio, memory ratio, energy savings percentage
// Compare multiple precisions
std::vector<AlgorithmProfile> profiles = {fp32, fp16, bf16, int8};
AlgorithmProfiler::compareMultiple(std::cout, profiles);

Use case: Get a complete picture of your algorithm’s resource requirements across multiple precisions.


pareto_explorer.hpp - Accuracy/Energy Trade-off Analysis

Section titled “pareto_explorer.hpp - Accuracy/Energy Trade-off Analysis”

Purpose: Find Pareto-optimal precision configurations that balance accuracy, energy, and memory bandwidth.

What it does: Given a set of precision configurations with their accuracy/energy/bandwidth characteristics, computes the Pareto frontier—the set of configurations where no other configuration is better in all dimensions.

Header: #include <universal/utility/pareto_explorer.hpp>

#include <universal/utility/pareto_explorer.hpp>
using namespace sw::universal;
// Create explorer (comes pre-loaded with standard types)
ParetoExplorer explorer;
// Add custom configuration if needed
explorer.addConfiguration("my_custom_type", 24, 1e-5, 0.8, 0.75);
// Parameters: name, bits, accuracy, energy_factor, bandwidth_factor
// Compute 2D Pareto frontier (accuracy vs energy)
ParetoResult result = explorer.computeFrontier();
// Compute 3D Pareto frontier (accuracy vs energy vs bandwidth)
ParetoResult result3d = explorer.computeFrontier3D();
// Query recommendations
PrecisionConfig best_for_accuracy = result.bestForAccuracy(1e-4);
PrecisionConfig best_for_energy = result.bestForEnergy(0.5); // 50% of FP32 energy
// For specific algorithm characteristics
AlgorithmCharacteristics gemm = ParetoExplorer::profileGEMM(1024, 1024, 1024);
PrecisionConfig best = result.bestForAlgorithm(1e-4, gemm);
// Generate comprehensive report
explorer.report(std::cout);
// Visualize frontier (ASCII art)
explorer.plotFrontier(std::cout);
// Roofline-style analysis
explorer.rooflineAnalysis(std::cout, 100.0); // 100 GB/s bandwidth

Pre-loaded configurations:

TypeBitsAccuracyEnergyBandwidth
FP64642.2e-163.53x2.0x
FP32321.2e-71.0x1.0x
FP16169.8e-40.31x0.5x
BF16167.8e-30.31x0.5x
posit<32,2>327.5e-90.5x1.0x
posit<16,1>162.4e-40.15x0.5x
INT883.9e-30.13x0.25x

Algorithm profiles:

// Common profiles for different algorithm types
auto dot_product = ParetoExplorer::profileDotProduct(1000000);
auto gemm_large = ParetoExplorer::profileGEMM(1024, 1024, 1024);
auto conv2d = ParetoExplorer::profileConv2D(224, 224, 3, 64, 3);

Use case: Systematically explore the precision design space and identify optimal configurations for your accuracy and resource requirements.


precision_config_generator.hpp - Code Generation

Section titled “precision_config_generator.hpp - Code Generation”

Purpose: Generate C++ header files with type aliases for mixed-precision algorithm implementations.

What it does: Based on Pareto analysis results, generates a ready-to-use header file defining InputType, ComputeType, AccumulatorType, and OutputType for your algorithm.

Header: #include <universal/utility/precision_config_generator.hpp>

#include <universal/utility/precision_config_generator.hpp>
using namespace sw::universal;
// Configure the generator
PrecisionConfigGenerator gen;
gen.setAlgorithm("GEMM");
gen.setAccuracyRequirement(1e-4);
gen.setEnergyBudget(0.5); // 50% of FP32 energy
gen.setProblemSize("1024x1024");
// Generate C++ header
std::string header = gen.generateConfigHeader();
std::cout << header;
// Generate example usage code
std::string example = gen.generateExampleCode();
// Generate comparison report
std::string report = gen.generateComparisonReport();
// Print full analysis
gen.printAnalysis(std::cout);

Generated header example:

// Auto-generated mixed-precision configuration
// Algorithm: GEMM
// Generated: 2024-02-07 15:30:00
//
// Requirements:
// Accuracy: 1.0e-04
// Energy budget: 50% of FP32
//
// Estimated energy: 35% of all-FP32
//
#pragma once
#include <universal/number/cfloat/cfloat.hpp>
#include <universal/number/posit/posit.hpp>
namespace gemm_config {
// Input precision - for loading data
using InputType = sw::universal::half;
// Compute precision - for arithmetic operations
using ComputeType = sw::universal::posit<16,1>;
// Accumulator precision - for reductions and dot products
using AccumulatorType = float;
// Output precision - for storing results
using OutputType = float;
// Configuration metadata
constexpr double target_accuracy = 1.0e-04;
constexpr double estimated_energy_factor = 0.35;
} // namespace gemm_config

Use case: Automate the generation of precision configuration headers for your mixed-precision implementations.


instrumented.hpp - Operation Instrumentation

Section titled “instrumented.hpp - Operation Instrumentation”

Purpose: Transparently wrap any number type to count all operations performed on it.

What it does: The instrumented<T> wrapper behaves exactly like type T but increments global atomic counters for every arithmetic operation. Thread-safe for parallel algorithms.

Header: #include <universal/utility/instrumented.hpp>

#include <universal/utility/instrumented.hpp>
using namespace sw::universal;
// Reset counters before profiling
instrumented_stats::reset();
// Use instrumented type instead of the base type
using Real = instrumented<cfloat<32,8>>;
// Or: using Real = instrumented<posit<32,2>>;
// Or: using Real = instrumented<float>;
// Run your algorithm normally
Real a = 1.5, b = 2.5;
Real c = a + b; // add counted
Real d = a * b; // mul counted
Real e = c / d; // div counted
// Get operation counts
instrumented_stats::report(std::cout);
// Or get programmatic access
uint64_t total_ops = instrumented_stats::totalArithmeticOps();
auto snapshot = instrumented_stats::snapshot<Real::value_type>();

Tracked operations:

  • Memory: loads, stores
  • Arithmetic: adds, subs, muls, divs, rems, sqrts
  • Other: comparisons, conversions

Key features:

  • Thread-safe using atomic counters
  • Zero overhead when not using instrumented types
  • Works with any Universal number type
  • Transparent drop-in replacement

Use case: Profile your algorithm to count exact operation counts, then combine with energy models for accurate energy estimation.


Use instrumented<T> to count operations and range_analyzer to track value distributions:

// Profile with instrumented types
instrumented_stats::reset();
using Real = instrumented<double>;
run_algorithm<Real>(data, result);
auto ops = instrumented_stats::snapshot<double>();
// Analyze value ranges
range_analyzer<double> analyzer;
for (auto& v : result) analyzer.observe(v);
analyzer.report(std::cout);

Use TypeAdvisor to get recommendations:

TypeAdvisor advisor;
AccuracyRequirement acc(1e-4);
auto recommendations = advisor.recommend(analyzer, acc);
for (const auto& rec : recommendations) {
std::cout << rec.type.name << ": " << rec.suitability_score << "%\n";
}

Use ParetoExplorer to understand the design space:

ParetoExplorer explorer;
auto result = explorer.computeFrontier3D();
// Find best for your constraints
auto best = result.bestForConstraints(
1e-4, // accuracy requirement
0.5, // max 50% energy vs FP32
0.5 // max 50% bandwidth vs FP32
);

Use PrecisionConfigGenerator to create headers:

PrecisionConfigGenerator gen;
gen.setAlgorithm("MyAlgorithm");
gen.setAccuracyRequirement(1e-4);
gen.setEnergyBudget(0.5);
std::ofstream out("my_algorithm_config.hpp");
out << gen.generateConfigHeader();

Step 5: Implement Mixed-Precision Algorithm

Section titled “Step 5: Implement Mixed-Precision Algorithm”

Use the generated configuration:

#include "my_algorithm_config.hpp"
using namespace my_algorithm_config;
void my_algorithm(const std::vector<InputType>& input,
std::vector<OutputType>& output) {
// Load data at input precision
std::vector<ComputeType> work(input.begin(), input.end());
// Compute with accumulator for reductions
AccumulatorType sum = 0;
for (const auto& v : work) {
sum += AccumulatorType(v);
}
// Store at output precision
output.push_back(OutputType(sum));
}

complete_mixed_precision_workflow.cpp
#include <universal/number/cfloat/cfloat.hpp>
#include <universal/number/posit/posit.hpp>
#include <universal/utility/instrumented.hpp>
#include <universal/utility/range_analyzer.hpp>
#include <universal/utility/type_advisor.hpp>
#include <universal/utility/algorithm_profiler.hpp>
#include <universal/utility/pareto_explorer.hpp>
#include <universal/utility/precision_config_generator.hpp>
#include <vector>
#include <iostream>
using namespace sw::universal;
// Your algorithm template
template<typename T>
T dot_product(const std::vector<T>& x, const std::vector<T>& y) {
T result = 0;
for (size_t i = 0; i < x.size(); ++i) {
result += x[i] * y[i];
}
return result;
}
int main() {
const size_t N = 10000;
// Step 1: Profile with double precision
std::vector<double> x(N), y(N);
for (size_t i = 0; i < N; ++i) {
x[i] = sin(i * 0.01);
y[i] = cos(i * 0.01);
}
// Count operations
instrumented_stats::reset();
using InstrumentedReal = instrumented<double>;
std::vector<InstrumentedReal> ix(x.begin(), x.end());
std::vector<InstrumentedReal> iy(y.begin(), y.end());
auto result = dot_product(ix, iy);
std::cout << "=== Step 1: Operation Profile ===\n";
instrumented_stats::report(std::cout);
// Step 2: Analyze value ranges
range_analyzer<double> analyzer;
for (const auto& v : x) analyzer.observe(v);
for (const auto& v : y) analyzer.observe(v);
std::cout << "\n=== Step 2: Range Analysis ===\n";
analyzer.report(std::cout);
// Step 3: Get type recommendations
TypeAdvisor advisor;
AccuracyRequirement acc(1e-6);
std::cout << "\n=== Step 3: Type Recommendations ===\n";
advisor.report(std::cout, analyzer, acc);
// Step 4: Explore trade-offs
ParetoExplorer explorer;
std::cout << "\n=== Step 4: Pareto Analysis ===\n";
explorer.report(std::cout);
// Step 5: Generate configuration
PrecisionConfigGenerator gen;
gen.setAlgorithm("DotProduct");
gen.setAccuracyRequirement(1e-6);
gen.setEnergyBudget(0.3);
std::cout << "\n=== Step 5: Generated Configuration ===\n";
std::cout << gen.generateConfigHeader();
return 0;
}

UtilityPurposeWhen to Use
occurrenceCount operationsBasic operation profiling
scale_trackerTrack exponent distributionUnderstand dynamic range needs
range_analyzerComprehensive value analysisProfile algorithm value ranges
type_advisorRecommend number typesChoose types for accuracy needs
memory_profilerModel memory costsAnalyze memory-bound algorithms
algorithm_profilerUnified profilingComplete algorithm analysis
pareto_explorerTrade-off explorationFind optimal precision configs
precision_config_generatorGenerate codeCreate mixed-precision headers
instrumentedOperation countingAccurate operation profiling

These utilities work together to provide a systematic approach to mixed-precision algorithm design, replacing guesswork with data-driven precision selection.