Energy: the how efficient

Table 1 shows switching energy estimates of key computational events by process node. Data movement operations (reads and writes) have started to dominate energy consumption in modern processors. This makes a Stored Program Machine (SPM) less and less efficient. To counter this, all CPUs, GPUs, and DSPs have started to add instructions that amortize instruction processing among more computational intensity: they have all become SIMD machines.

Fundamentally, the SPM relies on a request/reply protocol to get information from a memory. Otherwise stated, the resource contention mechanism deployed by a SPM uses a random access memory to store inputs, intermediate, and output values. And all this memory management uses this request/reply cycle. Which we now know is becoming less and less energy efficient compared to the actual computational event the algorithm requires. The sequential processing model is becoming less and less energy efficient.

We see that the further the memory is from the ALUs the worse the energy imbalance is. This has spawn Processor-In-Memory (PIM) and In-Memory-Compute (IMC) structures where the processing elements are multiplied and pushed into the memory. This improves the energy efficiency of the request/reply cycle, but it complicates the data distribution problem.

Fine-grained data paths so common in real-time designs are more energy efficient than their SPM counterparts because they do not rely on the request/reply cycle. Instead, they operate in a pipeline fashion with parallel operational units directly writing results to the next stage, removing the reliance on random access memory to orchestrate computational schedules.

We have seen the Data Flow Machine (DFM) maintaining fine-grain parallelism using a finite number of processing elements. Unfortunately, the basic operation of the DFM is less efficient than the basic operation of the SPM. Furthermore, the DFM has no mechanism to cater to the spatial relationships among a collection of operations. Structured parallelism is treated the same as unstructured parallelism, and incurs an unnecessary penalty. But the DFM does provide a hint of how to maintain fine-grain parallelism: its pipeline is a ring, which is an infinite, but bounded structure.

The Domain Flow Architecture (DFA) builds upon this observation and supports and maintains a local fine-grain spatial structure while offering an infinite computational fabric with finite resources. DFA is to DFM as PIM is to SPM.

Values in picojoules (pJ) per operation

Operation Type28/22nm16/14/12nm7/6/5nm3nm2nm
32-bit Register Read0.0400.0250.0120.0080.006
32-bit Register Write0.0450.0280.0140.0090.007
32-bit ALU Operation0.1000.0600.0300.0200.015
32-bit FPU Add0.4000.2500.1200.0800.060
32-bit FPU Multiply0.8000.5000.2500.1700.130
32-bit FPU FMA1.0000.6000.3000.2000.150
32-bit Word Read L10.6250.3750.18750.1250.09375
32-bit Word Read L21.8751.1250.56250.3750.28125
32-bit Word Read DDR56.255.0003.7503.1252.8125
64-byte L1 Cache Read10.0006.0003.0002.0001.500
64-byte L2 Cache Read30.00018.0009.0006.0004.500
64-byte DDR5 Memory Read100.00080.00060.00050.00045.000

Table 1: Switching Energy Estimate by Process Node

note

  1. 32-bit cache and memory operations are derived from 64byte read energy
  2. Smaller process nodes generally reduces switching energy by roughly 40-50% per major node transition