Results and Future Implications
Intel’s developers ported several high performance computing applications to this architecture, in what can only be described as a heroic effort. The applications were a heat equation solver using the stencil method, SGEMM a single precision matrix multiply, financial modeling for spreadsheets and a 2D fast-fourier transform. The heat equation problem was able to achieve 1TFLOP at 80 degrees C, 1.07V and 4.27GHz, with a measured power draw of 97W. The efficiency for the applications varied strongly, depending on the communication required as shown in Table 1 below.
Table 1 – Algorithmic Efficiency for Polaris
These results show that for algorithms with relatively little inter-node communication, the system design works well. The stencil algorithm uses mathematical properties of the heat equation to reduce communication, and consequently the high efficiency isn’t very surprising. However, for workloads with lots of communication, such as the 2D FFT, it is clear that the network could use some improvements and points the way to further research.
The mesochronous clocking and network is very interesting, since it offers the opportunity to reduce a significant source of power consumption. However, it is not quite ready for implementation in mainstream x86 MPUs. Regular MPUs have caches, instead of non-coherent memory blocks, and usually the last level of cache is shared between multiple cores. The design of the cache architecture will likely have implications for clocking, since cores sharing a cache should probably operate synchronously; alternatively, the cache itself could be designed to be somewhat asynchronous, similar to the 12MB L3 cache in Montecito. There will also be interaction effects between the asynchronous clocking and other portions of the chip, but it seems likely that clever design should be able to accommodate these issues.
 Anderson, F. et al. The Core Clock System for the Next Generation Itanium Microprocessor. International Solid-State Circuits Conference Technical Digest, February 2002.
 Mahoney, P. et al. Clock Distribution on a Dual-Core Multi-Threaded Itanium Family Processor. International Solid-State Circuits Conference Technical Digest, February 2005.
 Jacobson, H. et al. Stretching the Limits of Clock-Gating Efficiency in Server-Class Processors. International Symposium on High Performance Computer Architecture, February, 2005.
 Naffziger, S. et al. The Implementation of a 2-core, Multithreaded Itanium Family Processor. International Solid-State Circuits Conference Technical Digest, February 2005.
 Vangal, S. et al. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS. International Solid-State Circuits Conference Technical Digest, February 2007.