Changes in the Core
While the microarchitecture of the POWER6 is different from the previous generations, there is no question that it descended from the POWER4 core first announced in 2000. It is expected that IBM will discuss the microarchitecture much more thoroughly at ISSCC and or Hot Chips next year, but in the mean time, some details came out in this presentation. IBM claims that the POWER6 will approximately double the POWER5’s performance. This was accomplished by doubling the frequency and bandwidth, while maintaining the same pipeline depth, and a host of incremental microarchitectural tweaks.
The basic pipeline for the POWER6 is the same number of stages as the POWER5, but they have been rebalanced across the different phases. Most significantly, dependent ALU operations now can execute back to back, eliminating a vexing kludge in the original POWER4/5 architecture. This makes the out-of-order scheduling easier, and is probably the reason that the instruction issue/dispatch phase uses 2 cycles in the POWER6 (compared to 4 in the POWER5). The presentation hinted at other changes, but did not elaborate further.
The POWER6 L1 data cache has been doubled to 64KB, and associativity increased to 8 ways, as was disclosed at ISSCC earlier this year. As a result, the L1D latency has increased to 4 cycles, compared to 3 cycles for the POWER5 and most other high performance MPUs. As we speculated earlier, the POWER6 includes two 4MB private L2 caches. While the caches are private, there is a cast-out buffer, which facilitates rapid communication between the two, without touching the L3 cache or main memory. It is widely acknowledged that shared caches offer higher performance, all things being equal. However, in the situation of the POWER6, all things were not equal. In particular, the physical design considerations trumped microarchitectural elegance. At 8MB, the L2 cache is too large to probe in the target access time at the desired bandwidth; hence the cache was split in half. The L3 cache was also improved, by eliminating the sectoring technique used in the POWER5+; this increased the effective size of the L3 cache, at minimal cost in actual die area. Many of these slight tweaks to the caches, especially greater associativity, are extremely beneficial for multithreaded execution, and helped IBM achieve an even greater speedup for SMT in the POWER6 than in its predecessors.
The POWER6 appears to preserve all the functional units from the previous generation, but also adds hardware support for binary-coded decimal (BCD) and the Altivec extensions. According to IBM, slightly over half of all their user data is in BCD form, which seems quite reasonable given common workloads for IBM’s System i and p users. Most RISC architectures pushed BCD support into software libraries provided by the system vendor, which is consistent with the philosophy espoused by the early projects such as the 801, MIPS and RISC-1,2. However, IBM’s POWER architecture has always been more CISC-like than competing architectures. To support the upcoming IEEE 754R standard, which governs BCD, IBM added around 50 new instructions, and a decimal FPU. All the basic instructions, add, multiply and divide are represented, along with scaling, conversion and other key functions. The new decimal functional unit shares the FP registers, and also the FP status and control registers. The unit is effectively quad precision, offering up to 36 digit accuracy in 144 bits, although results are compressed to 128 bits to fit in two floating point registers and then decompressed before consumption. Basic operations are somewhat slower than ALU operations, with single cycle throughput, but 2 cycle latency. While IBM did not provide any measured performance benefits, they estimated for a telecommunications billing benchmark that BCD support could improve performance by 7x, 4x or 2x, compared against Java, C/C# or assembly libraries respectively. Similarly, an AltiVec execution unit has been added to the POWER6, although that particular bit of microarchitecture is relatively well documented in the PPC970 and other processors.