Energy: the how efficient
Table 1 shows switching energy estimates of key computational events by process node. Data movement operations (reads and writes) have started to dominate energy consumption in modern processors. This makes a Stored Program Machine (SPM) less and less efficient. To counter this, all CPUs, GPUs, and DSPs have started to add instructions that amortize instruction processing among more computational intensity: they have all become SIMD machines.
Fundamentally, the SPM relies on a request/reply protocol to get information from a memory. Otherwise stated, the resource contention mechanism deployed by a SPM uses a random access memory to store inputs, intermediate, and output values. And all this memory management uses this request/reply cycle. Which we now know is becoming less and less energy efficient compared to the actual computational event the algorithm requires. The sequential processing model is becoming less and less energy efficient.
We see that the further the memory is from the ALUs the worse the energy imbalance is. This has spawn Processor-In-Memory (PIM) and In-Memory-Compute (IMC) structures where the processing elements are multiplied and pushed into the memory. This improves the energy efficiency of the request/reply cycle, but it complicates the data distribution problem.
Fine-grained data paths so common in real-time designs are more energy efficient than their SPM counterparts because they do not rely on the request/reply cycle. Instead, they operate in a pipeline fashion with parallel operational units directly writing results to the next stage, removing the reliance on random access memory to orchestrate computational schedules.
We have seen the Data Flow Machine (DFM) maintaining fine-grain parallelism using a finite number of processing elements. Unfortunately, the basic operation of the DFM is less efficient than the basic operation of the SPM. Furthermore, the DFM has no mechanism to cater to the spatial relationships among a collection of operations. Structured parallelism is treated the same as unstructured parallelism, and incurs an unnecessary penalty. But the DFM does provide a hint of how to maintain fine-grain parallelism: its pipeline is a ring, which is an infinite, but bounded structure.
The Domain Flow Architecture (DFA) builds upon this observation and supports and maintains a local fine-grain spatial structure while offering an infinite computational fabric with finite resources. DFA is to DFM as PIM is to SPM.
Values in picojoules (pJ) per operation
Operation Type | 28/22nm | 16/14/12nm | 7/6/5nm | 3nm | 2nm |
---|---|---|---|---|---|
32-bit Register Read | 0.040 | 0.025 | 0.012 | 0.008 | 0.006 |
32-bit Register Write | 0.045 | 0.028 | 0.014 | 0.009 | 0.007 |
32-bit ALU Operation | 0.100 | 0.060 | 0.030 | 0.020 | 0.015 |
32-bit FPU Add | 0.400 | 0.250 | 0.120 | 0.080 | 0.060 |
32-bit FPU Multiply | 0.800 | 0.500 | 0.250 | 0.170 | 0.130 |
32-bit FPU FMA | 1.000 | 0.600 | 0.300 | 0.200 | 0.150 |
32-bit Word Read L1 | 0.625 | 0.375 | 0.1875 | 0.125 | 0.09375 |
32-bit Word Read L2 | 1.875 | 1.125 | 0.5625 | 0.375 | 0.28125 |
32-bit Word Read DDR5 | 6.25 | 5.000 | 3.750 | 3.125 | 2.8125 |
64-byte L1 Cache Read | 10.000 | 6.000 | 3.000 | 2.000 | 1.500 |
64-byte L2 Cache Read | 30.000 | 18.000 | 9.000 | 6.000 | 4.500 |
64-byte DDR5 Memory Read | 100.000 | 80.000 | 60.000 | 50.000 | 45.000 |
Table 1: Switching Energy Estimate by Process Node
note
- 32-bit cache and memory operations are derived from 64byte read energy
- Smaller process nodes generally reduces switching energy by roughly 40-50% per major node transition