Mixed-Precision Algorithm Design Utilities
This document describes the utilities in include/sw/universal/utility/ that support mixed-precision algorithm design and optimization. These tools help you analyze value ranges, estimate energy consumption, explore accuracy/energy trade-offs, and generate optimized precision configurations.
Table of Contents
Section titled “Table of Contents”Overview
Section titled “Overview”Why Mixed-Precision?
Section titled “Why Mixed-Precision?”Mixed-precision computing uses different numerical precisions for different parts of an algorithm to optimize:
- Energy efficiency: Lower-precision operations consume less power
- Memory bandwidth: Smaller data types reduce memory traffic
- Performance: More values fit in cache; SIMD lanes can process more elements
The challenge is selecting the right precision for each variable without sacrificing accuracy.
The Mixed-Precision SDK
Section titled “The Mixed-Precision SDK”The Universal library provides a complete SDK for mixed-precision algorithm design:
┌─────────────────────────────────────────────────────────────────┐│ Mixed-Precision Workflow │├─────────────────────────────────────────────────────────────────┤│ 1. Profile → 2. Analyze → 3. Explore → 4. Generate ││ ─────────── ────────── ────────── ─────────── ││ occurrence range_analyzer pareto_explorer config_gen ││ scale_tracker type_advisor algorithm_prof ││ instrumented memory_profiler │└─────────────────────────────────────────────────────────────────┘Core Utilities
Section titled “Core Utilities”occurrence.hpp - Operation Counting
Section titled “occurrence.hpp - Operation Counting”Purpose: Count arithmetic operations (add, sub, mul, div, sqrt) during algorithm execution to estimate compute costs.
What it does: A simple struct that tracks operation counts for a specific number type. Used as a building block for energy estimation.
Header: #include <universal/utility/occurrence.hpp>
#include <universal/utility/occurrence.hpp>using namespace sw::universal;
// Create occurrence trackeroccurrence<float> ops;
// Update counts (typically done by instrumented types)ops.add = 1000;ops.mul = 2000;ops.div = 100;
// Report countsops.report(std::cout);Key fields:
load,store: Memory operationsadd,sub,mul,div,rem: Arithmetic operationssqrt: Square root operations
Use case: Foundation for energy estimation when combined with energy cost models.
scale_tracker.hpp - Exponent Distribution Tracking
Section titled “scale_tracker.hpp - Exponent Distribution Tracking”Purpose: Track the distribution of exponents (scales) in computed values to understand dynamic range requirements.
What it does: Maintains a histogram of exponent values, counting how many values fall into each exponent bucket. Also tracks underflows and overflows.
Header: #include <universal/utility/scale_tracker.hpp>
#include <universal/utility/scale_tracker.hpp>using namespace sw::universal;
// Track scales from 2^-10 to 2^10scaleTracker tracker(-10, 10);
// During computation, record scalesfor (double value : computed_values) { int scale = ilogb(value); // Extract exponent tracker.incr(scale);}
// Report distributiontracker.report(std::cout);// Output shows histogram + underflow/overflow countsKey methods:
incr(scale): Increment count for given exponentclear(): Reset countsreport(ostream): Print histogram
Use case: Determine the required exponent range for your algorithm. If most values cluster in a narrow range, you can use a type with fewer exponent bits.
range_analyzer.hpp - Value Range Analysis
Section titled “range_analyzer.hpp - Value Range Analysis”Purpose: Comprehensive analysis of value distributions to recommend appropriate precision.
What it does: Observes values during computation and collects statistics including min/max values, scale range, subnormal counts, and special values (inf, nan). Generates precision recommendations.
Header: #include <universal/utility/range_analyzer.hpp>
#include <universal/utility/range_analyzer.hpp>using namespace sw::universal;
// Create analyzer for your computation typerange_analyzer<double> analyzer;
// Observe values during computationfor (const auto& result : computation_results) { analyzer.observe(result);}
// Get comprehensive reportanalyzer.report(std::cout);
// Get precision recommendationPrecisionRecommendation rec = analyzer.recommendPrecision();std::cout << "Suggested type: " << rec.type_suggestion << "\n";std::cout << "Min exponent bits: " << rec.min_exponent_bits << "\n";std::cout << "Min fraction bits: " << rec.min_fraction_bits << "\n";Key methods:
observe(value): Record a single valueobserve(begin, end): Record a range of valuesstatistics(): GetRangeStatisticsstructminScale(),maxScale(): Get exponent rangedynamicRangeUtilization(): Fraction of type’s range actually usedrecommendPrecision(): GetPrecisionRecommendation
Output includes:
- Value classification (zeros, normals, denormals, inf, nan)
- Signed value distribution
- Value range (min, max, min absolute, max absolute)
- Scale span (number of decades)
- Type recommendation with rationale
Use case: Profile your algorithm’s intermediate values to understand what precision is actually needed vs. what you’re using.
type_advisor.hpp - Type Recommendation
Section titled “type_advisor.hpp - Type Recommendation”Purpose: Recommend optimal Universal number types based on range analysis and accuracy requirements.
What it does: Takes range analysis results and accuracy requirements, then scores and ranks all available Universal types (cfloat, posit, fixpnt, lns) to recommend the best match.
Header: #include <universal/utility/type_advisor.hpp>
#include <universal/utility/type_advisor.hpp>using namespace sw::universal;
// First analyze your datarange_analyzer<double> analyzer;for (auto v : data) analyzer.observe(v);
// Create type advisorTypeAdvisor advisor;
// Specify accuracy requirementsAccuracyRequirement accuracy;accuracy.relative_error = 1e-4; // 0.01% toleranceaccuracy.require_inf = false; // Don't need infinityaccuracy.require_nan = false; // Don't need NaN
// Get recommendationsauto recommendations = advisor.recommend(analyzer, accuracy);
// Print reportadvisor.report(std::cout, analyzer, accuracy);
// Get best recommendationTypeRecommendation best = advisor.bestType(analyzer, accuracy);std::cout << "Best type: " << best.type.name << "\n";std::cout << "Score: " << best.suitability_score << "%\n";std::cout << "Energy factor: " << best.estimated_energy << "x\n";Key structures:
TypeCharacteristics: Properties of each number type
name,family,total_bits,exponent_bits,fraction_bitsmax_value,min_positive,epsilonenergy_per_fma(picojoules estimate)has_subnormals,has_inf,has_nan
TypeRecommendation: Scored recommendation
type: TheTypeCharacteristicssuitability_score: 0-100 (higher = better match)meets_accuracy,meets_range: Boolean checksestimated_energy: Relative to FP32rationale: Why this type was recommended
Known types in database:
- IEEE: half, bfloat16, float, double
- Posit: posit<8,0>, posit<16,1>, posit<32,2>, posit<64,3>
- Fixed-point: fixpnt<16,8>, fixpnt<32,16>
- LNS: lns<16,8>, lns<32,16>
Use case: When you know your accuracy requirements, get an objective ranking of which Universal types best fit your needs.
memory_profiler.hpp - Memory Access Profiling
Section titled “memory_profiler.hpp - Memory Access Profiling”Purpose: Model memory access patterns and estimate memory-related energy costs.
What it does: Tracks memory reads and writes, estimates cache behavior based on working set size, and calculates energy costs across the memory hierarchy (L1, L2, L3, DRAM).
Header: #include <universal/utility/memory_profiler.hpp>
#include <universal/utility/memory_profiler.hpp>using namespace sw::universal;
// Create profiler with cache configurationMemoryProfiler profiler(CacheConfig::intel_skylake());
// Record memory accessessize_t matrix_bytes = M * N * sizeof(float);profiler.recordRead(matrix_bytes, AccessPattern::Sequential);profiler.recordWrite(result_bytes, AccessPattern::Sequential);profiler.setWorkingSetSize(total_working_set);
// Get reportprofiler.report(std::cout);
// Get energy estimatedouble energy_pj = profiler.estimateEnergyPJ();double energy_uj = profiler.estimateEnergyUJ();
// Check cache tierMemoryTier tier = profiler.estimatePrimaryTier();// Returns: L1_Cache, L2_Cache, L3_Cache, or DRAMPre-built BLAS profiles:
// Profile common linear algebra operationsauto gemm_profile = profileGEMM(M, N, K, sizeof(float));auto dot_profile = profileDotProduct(N, sizeof(float));auto gemv_profile = profileGEMV(M, N, sizeof(float));Access patterns:
AccessPattern::Sequential: Linear traversal (best cache utilization)AccessPattern::Strided: Regular stride (e.g., column access)AccessPattern::Random: Irregular access (worst case)AccessPattern::Reuse: Repeated access to same data
Cache configurations:
CacheConfig::intel_skylake(): 32KB L1, 256KB L2, 8MB L3CacheConfig::arm_cortex_a76(): 64KB L1, 512KB L2, 4MB L3CacheConfig::arm_cortex_a55(): 32KB L1, 128KB L2, 2MB L3
Use case: Estimate memory energy costs for your algorithm and understand if you’re memory-bound or compute-bound.
algorithm_profiler.hpp - Unified Algorithm Profiling
Section titled “algorithm_profiler.hpp - Unified Algorithm Profiling”Purpose: Combine operation counting, memory profiling, and energy estimation into a single analysis framework.
What it does: Provides AlgorithmProfile structs that capture all dimensions of algorithm cost: operation counts, memory access, value ranges, and energy estimates.
Header: #include <universal/utility/algorithm_profiler.hpp>
#include <universal/utility/algorithm_profiler.hpp>using namespace sw::universal;
// Profile a GEMM operationAlgorithmProfile profile = AlgorithmProfiler::profileGEMM( 1024, 1024, 1024, // M, N, K dimensions "float", 32, // precision name, bit width CacheConfig::intel_skylake());
// Print comprehensive reportprofile.report(std::cout);
// Access individual metricsstd::cout << "FMAs: " << profile.fmas << "\n";std::cout << "Working set: " << profile.working_set_bytes << " bytes\n";std::cout << "Arithmetic intensity: " << profile.ops_per_byte << " ops/byte\n";std::cout << "Compute energy: " << profile.compute_energy_pj << " pJ\n";std::cout << "Memory energy: " << profile.memory_energy_pj << " pJ\n";std::cout << "Total energy: " << profile.total_energy_pj << " pJ\n";Pre-built profilers:
// Dense linear algebraauto gemm = AlgorithmProfiler::profileGEMM(M, N, K, "float", 32);auto gemv = AlgorithmProfiler::profileGEMV(M, N, "float", 32);auto dot = AlgorithmProfiler::profileDotProduct(N, "float", 32);
// Deep learningauto conv = AlgorithmProfiler::profileConv2D( 224, 224, // H, W 3, 64, // C_in, C_out 3, // kernel size "float", 32);Comparing precisions:
auto fp32 = AlgorithmProfiler::profileGEMM(1024, 1024, 1024, "float", 32);auto fp16 = AlgorithmProfiler::profileGEMM(1024, 1024, 1024, "half", 16);
PrecisionComparison cmp = AlgorithmProfiler::compare(fp32, fp16);cmp.report(std::cout);// Shows: ops ratio, memory ratio, energy savings percentage
// Compare multiple precisionsstd::vector<AlgorithmProfile> profiles = {fp32, fp16, bf16, int8};AlgorithmProfiler::compareMultiple(std::cout, profiles);Use case: Get a complete picture of your algorithm’s resource requirements across multiple precisions.
pareto_explorer.hpp - Accuracy/Energy Trade-off Analysis
Section titled “pareto_explorer.hpp - Accuracy/Energy Trade-off Analysis”Purpose: Find Pareto-optimal precision configurations that balance accuracy, energy, and memory bandwidth.
What it does: Given a set of precision configurations with their accuracy/energy/bandwidth characteristics, computes the Pareto frontier—the set of configurations where no other configuration is better in all dimensions.
Header: #include <universal/utility/pareto_explorer.hpp>
#include <universal/utility/pareto_explorer.hpp>using namespace sw::universal;
// Create explorer (comes pre-loaded with standard types)ParetoExplorer explorer;
// Add custom configuration if neededexplorer.addConfiguration("my_custom_type", 24, 1e-5, 0.8, 0.75);// Parameters: name, bits, accuracy, energy_factor, bandwidth_factor
// Compute 2D Pareto frontier (accuracy vs energy)ParetoResult result = explorer.computeFrontier();
// Compute 3D Pareto frontier (accuracy vs energy vs bandwidth)ParetoResult result3d = explorer.computeFrontier3D();
// Query recommendationsPrecisionConfig best_for_accuracy = result.bestForAccuracy(1e-4);PrecisionConfig best_for_energy = result.bestForEnergy(0.5); // 50% of FP32 energy
// For specific algorithm characteristicsAlgorithmCharacteristics gemm = ParetoExplorer::profileGEMM(1024, 1024, 1024);PrecisionConfig best = result.bestForAlgorithm(1e-4, gemm);
// Generate comprehensive reportexplorer.report(std::cout);
// Visualize frontier (ASCII art)explorer.plotFrontier(std::cout);
// Roofline-style analysisexplorer.rooflineAnalysis(std::cout, 100.0); // 100 GB/s bandwidthPre-loaded configurations:
| Type | Bits | Accuracy | Energy | Bandwidth |
|---|---|---|---|---|
| FP64 | 64 | 2.2e-16 | 3.53x | 2.0x |
| FP32 | 32 | 1.2e-7 | 1.0x | 1.0x |
| FP16 | 16 | 9.8e-4 | 0.31x | 0.5x |
| BF16 | 16 | 7.8e-3 | 0.31x | 0.5x |
| posit<32,2> | 32 | 7.5e-9 | 0.5x | 1.0x |
| posit<16,1> | 16 | 2.4e-4 | 0.15x | 0.5x |
| INT8 | 8 | 3.9e-3 | 0.13x | 0.25x |
Algorithm profiles:
// Common profiles for different algorithm typesauto dot_product = ParetoExplorer::profileDotProduct(1000000);auto gemm_large = ParetoExplorer::profileGEMM(1024, 1024, 1024);auto conv2d = ParetoExplorer::profileConv2D(224, 224, 3, 64, 3);Use case: Systematically explore the precision design space and identify optimal configurations for your accuracy and resource requirements.
precision_config_generator.hpp - Code Generation
Section titled “precision_config_generator.hpp - Code Generation”Purpose: Generate C++ header files with type aliases for mixed-precision algorithm implementations.
What it does: Based on Pareto analysis results, generates a ready-to-use header file defining InputType, ComputeType, AccumulatorType, and OutputType for your algorithm.
Header: #include <universal/utility/precision_config_generator.hpp>
#include <universal/utility/precision_config_generator.hpp>using namespace sw::universal;
// Configure the generatorPrecisionConfigGenerator gen;gen.setAlgorithm("GEMM");gen.setAccuracyRequirement(1e-4);gen.setEnergyBudget(0.5); // 50% of FP32 energygen.setProblemSize("1024x1024");
// Generate C++ headerstd::string header = gen.generateConfigHeader();std::cout << header;
// Generate example usage codestd::string example = gen.generateExampleCode();
// Generate comparison reportstd::string report = gen.generateComparisonReport();
// Print full analysisgen.printAnalysis(std::cout);Generated header example:
// Auto-generated mixed-precision configuration// Algorithm: GEMM// Generated: 2024-02-07 15:30:00//// Requirements:// Accuracy: 1.0e-04// Energy budget: 50% of FP32//// Estimated energy: 35% of all-FP32//#pragma once
#include <universal/number/cfloat/cfloat.hpp>#include <universal/number/posit/posit.hpp>
namespace gemm_config {
// Input precision - for loading datausing InputType = sw::universal::half;
// Compute precision - for arithmetic operationsusing ComputeType = sw::universal::posit<16,1>;
// Accumulator precision - for reductions and dot productsusing AccumulatorType = float;
// Output precision - for storing resultsusing OutputType = float;
// Configuration metadataconstexpr double target_accuracy = 1.0e-04;constexpr double estimated_energy_factor = 0.35;
} // namespace gemm_configUse case: Automate the generation of precision configuration headers for your mixed-precision implementations.
instrumented.hpp - Operation Instrumentation
Section titled “instrumented.hpp - Operation Instrumentation”Purpose: Transparently wrap any number type to count all operations performed on it.
What it does: The instrumented<T> wrapper behaves exactly like type T but increments global atomic counters for every arithmetic operation. Thread-safe for parallel algorithms.
Header: #include <universal/utility/instrumented.hpp>
#include <universal/utility/instrumented.hpp>using namespace sw::universal;
// Reset counters before profilinginstrumented_stats::reset();
// Use instrumented type instead of the base typeusing Real = instrumented<cfloat<32,8>>;// Or: using Real = instrumented<posit<32,2>>;// Or: using Real = instrumented<float>;
// Run your algorithm normallyReal a = 1.5, b = 2.5;Real c = a + b; // add countedReal d = a * b; // mul countedReal e = c / d; // div counted
// Get operation countsinstrumented_stats::report(std::cout);
// Or get programmatic accessuint64_t total_ops = instrumented_stats::totalArithmeticOps();auto snapshot = instrumented_stats::snapshot<Real::value_type>();Tracked operations:
- Memory:
loads,stores - Arithmetic:
adds,subs,muls,divs,rems,sqrts - Other:
comparisons,conversions
Key features:
- Thread-safe using atomic counters
- Zero overhead when not using instrumented types
- Works with any Universal number type
- Transparent drop-in replacement
Use case: Profile your algorithm to count exact operation counts, then combine with energy models for accurate energy estimation.
Workflow Guide
Section titled “Workflow Guide”Step 1: Profile Your Algorithm
Section titled “Step 1: Profile Your Algorithm”Use instrumented<T> to count operations and range_analyzer to track value distributions:
// Profile with instrumented typesinstrumented_stats::reset();using Real = instrumented<double>;run_algorithm<Real>(data, result);auto ops = instrumented_stats::snapshot<double>();
// Analyze value rangesrange_analyzer<double> analyzer;for (auto& v : result) analyzer.observe(v);analyzer.report(std::cout);Step 2: Get Type Recommendations
Section titled “Step 2: Get Type Recommendations”Use TypeAdvisor to get recommendations:
TypeAdvisor advisor;AccuracyRequirement acc(1e-4);auto recommendations = advisor.recommend(analyzer, acc);
for (const auto& rec : recommendations) { std::cout << rec.type.name << ": " << rec.suitability_score << "%\n";}Step 3: Explore Trade-offs
Section titled “Step 3: Explore Trade-offs”Use ParetoExplorer to understand the design space:
ParetoExplorer explorer;auto result = explorer.computeFrontier3D();
// Find best for your constraintsauto best = result.bestForConstraints( 1e-4, // accuracy requirement 0.5, // max 50% energy vs FP32 0.5 // max 50% bandwidth vs FP32);Step 4: Generate Configuration
Section titled “Step 4: Generate Configuration”Use PrecisionConfigGenerator to create headers:
PrecisionConfigGenerator gen;gen.setAlgorithm("MyAlgorithm");gen.setAccuracyRequirement(1e-4);gen.setEnergyBudget(0.5);
std::ofstream out("my_algorithm_config.hpp");out << gen.generateConfigHeader();Step 5: Implement Mixed-Precision Algorithm
Section titled “Step 5: Implement Mixed-Precision Algorithm”Use the generated configuration:
#include "my_algorithm_config.hpp"
using namespace my_algorithm_config;
void my_algorithm(const std::vector<InputType>& input, std::vector<OutputType>& output) { // Load data at input precision std::vector<ComputeType> work(input.begin(), input.end());
// Compute with accumulator for reductions AccumulatorType sum = 0; for (const auto& v : work) { sum += AccumulatorType(v); }
// Store at output precision output.push_back(OutputType(sum));}Complete Example
Section titled “Complete Example”#include <universal/number/cfloat/cfloat.hpp>#include <universal/number/posit/posit.hpp>#include <universal/utility/instrumented.hpp>#include <universal/utility/range_analyzer.hpp>#include <universal/utility/type_advisor.hpp>#include <universal/utility/algorithm_profiler.hpp>#include <universal/utility/pareto_explorer.hpp>#include <universal/utility/precision_config_generator.hpp>#include <vector>#include <iostream>
using namespace sw::universal;
// Your algorithm templatetemplate<typename T>T dot_product(const std::vector<T>& x, const std::vector<T>& y) { T result = 0; for (size_t i = 0; i < x.size(); ++i) { result += x[i] * y[i]; } return result;}
int main() { const size_t N = 10000;
// Step 1: Profile with double precision std::vector<double> x(N), y(N); for (size_t i = 0; i < N; ++i) { x[i] = sin(i * 0.01); y[i] = cos(i * 0.01); }
// Count operations instrumented_stats::reset(); using InstrumentedReal = instrumented<double>; std::vector<InstrumentedReal> ix(x.begin(), x.end()); std::vector<InstrumentedReal> iy(y.begin(), y.end()); auto result = dot_product(ix, iy);
std::cout << "=== Step 1: Operation Profile ===\n"; instrumented_stats::report(std::cout);
// Step 2: Analyze value ranges range_analyzer<double> analyzer; for (const auto& v : x) analyzer.observe(v); for (const auto& v : y) analyzer.observe(v);
std::cout << "\n=== Step 2: Range Analysis ===\n"; analyzer.report(std::cout);
// Step 3: Get type recommendations TypeAdvisor advisor; AccuracyRequirement acc(1e-6);
std::cout << "\n=== Step 3: Type Recommendations ===\n"; advisor.report(std::cout, analyzer, acc);
// Step 4: Explore trade-offs ParetoExplorer explorer;
std::cout << "\n=== Step 4: Pareto Analysis ===\n"; explorer.report(std::cout);
// Step 5: Generate configuration PrecisionConfigGenerator gen; gen.setAlgorithm("DotProduct"); gen.setAccuracyRequirement(1e-6); gen.setEnergyBudget(0.3);
std::cout << "\n=== Step 5: Generated Configuration ===\n"; std::cout << gen.generateConfigHeader();
return 0;}Summary
Section titled “Summary”| Utility | Purpose | When to Use |
|---|---|---|
occurrence | Count operations | Basic operation profiling |
scale_tracker | Track exponent distribution | Understand dynamic range needs |
range_analyzer | Comprehensive value analysis | Profile algorithm value ranges |
type_advisor | Recommend number types | Choose types for accuracy needs |
memory_profiler | Model memory costs | Analyze memory-bound algorithms |
algorithm_profiler | Unified profiling | Complete algorithm analysis |
pareto_explorer | Trade-off exploration | Find optimal precision configs |
precision_config_generator | Generate code | Create mixed-precision headers |
instrumented | Operation counting | Accurate operation profiling |
These utilities work together to provide a systematic approach to mixed-precision algorithm design, replacing guesswork with data-driven precision selection.