The second day begins with a session on co-processors and accelerators. The central questions surrounding these papers are first and foremost the software model and programmability – history is littered with companies pushing peculiar, inconvenient and proprietary software models to increase performance, and not fully comprehending the huge value of existing binaries and legacy code. For those with good hearing, a slogan of ‘our customers will rewrite/recompile their code’ can sound an awful lot like ‘we don’t intend to have many customers’. GPU vendors are acutely aware of these issues, hence why OpenCL is an open standard and similar to OpenGL. Performance of mainstream hardware is also a consideration – to be attractive relative to general purpose processors (be it for PCs, networking, etc.), specialized hardware needs a sustained advantage, ideally 10X or better, over a 5 year time horizon.
The first paper is from Fujitsu on the latest SPARC64 microprocessor, which includes vector instructions and hardware for HPC applications – although likely not full gather/scatter functionality. The software model seems to be relatively clear, relying on Fujitsu and Sun’s vertical control of the platform (OS, libraries, compilers). Fujitsu will probably follow in the footsteps of AltiVec, SSE and other vector extensions. For real world applicability and performance, flexible vector permutation is essential and a feature to ask about.
Convey is presenting an HPC oriented system that uses a Virtex-5 FPGA as a coprocessor for Intel’s existing front-side bus systems. This is not a novel concept and was first touted by AMD. However, the programming model and ecosystem for such co-processors are at a nascent stage and hardly standardized. Verilog is definitely not a high productivity language, and goes against the trend of software developers favoring high level languages (e.g. Python) for productivity. Hopefully, Convey will have done extensive work on a compiler, libraries, debuggers and all the assorted tools needed for developers. Another outstanding question is the upgrade path for future systems to use CSI and Nehalem processors.
Nallatech is also presenting an FPGA accelerator, but a socket compatible solution for Intel servers rather than a bespoke system. This is nice business model that avoids competing with the HPC powerhouses of the world – HP, IBM, Sun and others, and presumably is co-opting them as partners and resellers. Nallatech works with Mitrionics, and by virtue of age hopefully has a somewhat robust software stack. A product that is socket compatible with Intel’s Nehalem would be a welcome roadmap entry.
Last is a presentation from Sun on next generation cryptography hardware in SPARC processors – inviting comparison with both prior generations and Intel’s SSE 4.2. Since the entire stack is controlled by Sun, the software model is relatively obvious – enable the compilers, OS, libraries, etc.
The second day’s keynote comes from Rich Hilleman of EA and is followed by a session with three embedded SOC presentations and one on MEMs based oscillators.
A later afternoon session discusses new FPGAs from the two powerhouses: Xilinx, and Altera. More intriguing is a presentation from SiliconBlue, a start up focusing on low-power FPGAs. FPGAs and low power and don’t typically go together, with ASICs often having an order of magnitude advantage or more. From a technical angle, SiliconBlue bears watching to understand how they improve their power efficiency relative to existing FPGAs and ASICs. One complication is that quantitative FPGA vs. ASIC comparison are almost inevitably skewed and unfair (e.g. a 40nm FPGA is probably comparable to a 130nm or 90nm ASIC, not a 40nm ASIC).
The last session is the one most likely to live up to the name of the conference – high-end server processors. Sun will start with Rainbow Falls, which is the third generation of the Niagara line, presumably manufactured in a 45nm process at TI. It is unclear whether the microarchitecture of the individual processor cores has changed substantially from Niagara II; a little innovation there would be welcome – and Oracle customers would certainly appreciate more ILP. More likely the focus is on improving the system architecture to be more scalable and well suited to enterprise workloads – existing multi-socket Niagara-based servers from Sun do not increase memory bandwidth or capacity when adding more sockets. This can be problematic as the memory capacity/core and bandwidth/core decreases – additionally workloads with large memory foot prints cannot simply be addressed by adding more processors.
New Information on POWER7
Last of all, IBM will make the first public disclosures on the POWER7 in two separate presentations. Presumably the first focuses on the processor microarchitecture, while the second addresses the system interfaces and architecture. We have already sussed out several of the features that IBM will announce, as discussed below.
The POWER7 will be a substantial departure from the POWER6, focusing less on frequency and more on power efficiency through multiple cores. In many respects, this change parallels Intel’s “right hand turn” in 2005 that doomed the P4. Both IBM and Intel claimed at ISCA 29 that peak performance (ignoring power) was achieved by high frequency designs with 10-20 FO4 delays per pipe stage. While the P4 and POWER6 are not direct results of this research, the two chips were unarguably influenced by this line of thinking. Both microarchitectures were built from the ground up for and achieved remarkable frequencies, at the cost of power efficiency. Hopefully IBM’s return to a more balanced design will be accompanied by disclosure of power draw and thermal dissipation numbers for the POWER7, and a comparison to the prior generation.
Since early 2007, there have been consistent and reliable rumors that the POWER7 will use on-die eDRAM for the last level cache. Based on presentations at ISSCC, eDRAM should have roughly twice the density of IBM’s SRAMs. It seems likely that the POWER7′s L3 cache will be around 16MB of eDRAM. This will hopefully reduce the need for external bandwidth, as the POWER6 systems will be very hard to improve upon; 300GB/s is just a tremendous amount of I/O period.
Given the jump in core and thread counts, the microarchitecture will probably improve on-chip synchronization latency (e.g. locks). Last, it appears that there are specific features in the processor to enable a cluster of POWER7 systems to appear as a single global shared memory system.
The Conference Awaits
That concludes our preview of Hot Chips this year. Naturally, more details will appear at the conference itself and detailed reports will trickle out afterwards. Until then, we’ve provided a few juicy details to contemplate and many questions to ponder.