IBM Previews the POWER6
At the MicroProcessor Forum, Dr. Brad McCredie of IBM continued to tease out particulars regarding the POWER6. The presentation discussed a lot of general microarchitecture features, but did not reveal many specific details; a full revelation of the microarchitecture will likely have to wait till ISSCC, next February. However, from the details that were revealed, it is clear that the POWER6 inherited many characteristics from its predecessors, yet made substantial improvements in others.
The POWER6 is targeted to run at 4-5GHz and was fabricated on IBM’s 65nm SOI process with 10 layers of metal. Compared to the 90nm process, there is a 30% performance increase at a given power level, largely due to the use of dual-stress line technology. IBM’s 65nm process offers a 0.65um high performance SRAM cell, and a 0.45um cell for density. The array cells use a lower supply voltage compared to the logic, to reduce power consumption. By all accounts, IBM heavily emphasized circuit design in the POWER6, as the means to increase frequency, while prior designs relied extensively on automated tools and logic design. This helps to explain how IBM was able to dramatically increase the frequency, but it is still hard to believe that such optimizations were never made previously. Leaving a 2x performance boost on the table seems unconscionable from a competitive positioning point of view.
Like the previous two generations, the POWER6 focuses on a big system environment where system architecture makes a substantial difference. Each POWER6 MPU is implemented as a two way CMP design, integrating two simultaneous multithreaded processors along with private per-core L2 caches in a 340mm2 die. For high-end models, four POWER6 MPUs will be packaged in a single multi-chip module, along with four L3 victim caches, each 32MB. Figure 1 below shows a high level comparison of the POWER5+ and POWER6 MPUs.
POWER5+ and POWER6 MPU Comparison
As the diagram indicates, the POWER6 has incredible bandwidth to feed the processors. At 5GHz, each MPU has 300GB/s of bandwidth, roughly 80GB/s from the L3 cache, 75GB/s from the memory, 80GB/s across the intra-MCM busses, 50GB/s from remote processors, and 20GB/s from local I/O. Generally, the POWER6 doubles the bandwidth of POWER5+ systems, due to frequency increases and adding some new interfaces. The non-core functions in the POWER6 all run at one half core frequency, in the 2-2.5GHz range, compared to roughly 0.8-1.15GHz for various POWER5+ processors. The POWER6 also hosts an additional memory controller and intra-MCM fabric link, and increases the I/O frequency from one third to one half the CPU frequency. Each memory controller connects to memory using the third generation of IBM’s synchronous memory interface. Like Fully Buffered DIMMs, these SMI chips enable larger memory configurations and different memory types (typically older DDR variants for capacity or newer DDR2/3 for bandwidth). The memory controllers and L3 cache all have separate address and data busses (address busses are not shown in the above image), while the interconnect fabric and GX+ I/O bus multiplex the addressing and data.
The system architecture for the POWER6 has been entirely redesigned and is far more elegant than its predecessor. For larger systems, the POWER5 used a pair of uni-directional rings for intra-MCM traffic, while inter-MCM traffic was routed over a 2D mesh. As Figure 2 below indicates, the POWER6 uses a two tier architecture and a new coherency protocol to match. Each POWER6 MCM forms a single ‘cell’, and up to 8 cells are arranged in a fully connected network. The new system architecture has lower and more consistent latencies. While low latency is essential for performance, consistent latency is substantially easier for operating systems to manage, especially Linux. For POWER6 systems, there are three levels of latency: MPU local, MCM local and remote. In comparison, in large POWER5+ systems, remote accesses could be anywhere from 1-4 inter-MCM hops, and 0-2 intra-MCM hops away. Figure 2 shows this, using different colors for different levels of latency: blue indicates the local MPU, lavender for MPUs on the same MCM, and green, tan, orange and red for 1-4 MCM hops. Another advantage of a cell based architecture is that each individual node can be off-lined without impacting the others, which improves the availability and serviceability of the system.
POWER5+ and POWER6 System Architecture and Latency
From the start, IBM has designed the POWER6 systems to be extremely configurable. The intra-node busses, which normally operate on 8 bytes/cycle can be chopped down to 2 bytes/cycle for low-end systems, and the inter-node busses can also operate at 4 bytes/cycle. Similarly, the two integrated memory controllers can both operate at half-width, and one of them can be removed entirely. The external L3 caches are optional, and are available either in the MCM, or in an external configuration. IBM claims all these options are to provide different price/performance models, in order to better serve customers. Obviously, some workloads may not cache well at all, and customers could order the stripped down parts to save money. Another factor may be that IBM is attempting to increase their yields by re-using devices that encountered manufacturing errors. For instance, if a L3 cache is incorrectly bonded into the MCM, it could then be repackaged as a ‘value’ product.