POWER4: Big Blue CPU Gets a “Mini-me” Sidekick
The conservative, highly automated design approach IBM chose for its POWER4 high end server processor has paid off twice now. The POWER4 product line was rolled out on schedule more than a year ago and now big blue is the first 64 bit MPU vendor to deliver a 0.13 mm based product, POWER4+. The down side of IBM’s microprocessor design methodology is that compared to a more traditional full custom approach, some of the potential performance of the process is left on the table in a trade-off for faster and cheaper development. For example, the top clock rate of POWER4 is only 50 MHz higher than the Alpha EV68 despite have a basic execution pipeline roughly twice as long and intrinsically faster logic from its SOI process.
The POWER4 packs two CPUs along with 1.4 MB of L2 cache (1.5 MB in POWER4+) as well as L3 cache controller, memory controller, and interprocessor communications functionality onto a 415 mm2 die. To accomplish this something had to give and that something was processor efficiency. Unlike other out-of-order RISC processors like PA-8×00, EV6x,and MIPS R1xk, the POWER4 doesn’t track individual instructions. Instead it collects together up to 5 instructions at a time and then dispatches, tracks, and retires them as a group. The processor only preserves machine state at group boundaries so an exception causes the machine to be backed up to the oldest group prior to the exception. . This is less cumbersome and complex than tracking individual instructions but comes at a significant cost. This cost can be divided into three main components: grouping formation overhead, grouping restrictions, and lost opportunities for parallelism.
The group formation overhead is quite easily observed. The POWER4 uses 6 pipeline stages to decode and crack (break down complex PowerPC instructions like load with update into separate simpler primitive load and add instructions) instructions and form them into groups. In comparison, the EV6x performs decoding in a single pipe stage. The extra pipe stages increase the branch misprediction penalty, and reduce performance. A POWER4 CPU can issue and/or retire one group per clock cycle. A group may have up to 5 instructions, but in practice restrictions to the way a group can be assembled means that groups will on average carry fewer than 5 instructions. For example, the fifth slot in a group is reserved for branches. If a sequence of instructions is being packed into a group and the third instruction is a branch then the group is padded with NOPs in the third and fourth slot and the branch is inserted into the fifth. If CPU primitive instructions from a cracked native instruction can’t all fit in a group then the group is padded out with NOPs and a new group is started. If an instruction sets a non-renamed architectural register then the group is padded out with NOPs. Instructions that cannot be executed speculatively are executed serially as single instruction groups. Before a group can be dispatched all the resources to support all the instructions in that group must be available. This is more restrictive than in other out-of-order execution RISC processors in which resource dependencies are resolved instruction by instruction.
To get an indication of how much performance the POWER4 loses due to its distended pipeline and restrictive instruction dispatch and tracking system its integer performance is compared to the Alpha EV68 in Table 2. Despite out-fetching, out-dispatching, and out-issuing the EV68; despite having a 12:1 advantage in low latency on-chip cache, a 2.5:1 advantage in the size of its out-of-order execution window, and a 4% clock rate advantage, the POWER4 can’t outperform it on SPECint2k,base or peak.
Fetch width (instructions)
Dispatch width (instructions)
non-FP Issue width (instructions)
L1 cache (I/D)
64 KB / 32 KB
64 KB / 64 KB
16 MB (off-chip)
32 MB (off-chip)
Max non-FP instructions in flight
Clock Frequency (MHz)
804 / 839
845 / 928
Despite its architectural inefficiencies, the POWER4 is one of the most competitive MPUs to come out of IBM labs in a long time thanks to the use of 2-way CMP and mainframe class high bandwidth data paths and system packaging technology. If IBM can continue to scale up clock frequency with each process shrink as fast as its full custom designed competitors it should be in good competitive shape for its intended market.
However the same cannot be said for all users of the PowerPC instruction set. The G4 and G4+ processors Apple currently uses in its Macintosh line of desktop and laptop computers are hopelessly out-muscled by the latest x86 processors from Intel and AMD. Worst yet, the growing use of SSE in multimedia and content creation software has put a slow leak in Apple’s competitive life preserver, the Altivec SIMD PowerPC instruction set extension. By some strange coincidence IBM has announced it was developing the PowerPC 970, a desktop class processor based on the POWER4 microarchitecture and extended with Altivec support. The relative die size and basic floorplans of the POWER4, POWER4+, and PowerPC 970 are shown in Figure 3.
Figure 4 Relative Size of POWER4, POWER4+, and PowerPC 970
The PowerPC 970, is a 0.13 mm SOI single CPU device with 512 KB of on-chip L2 cache. IBM estimates its performance at 937 SPECint2k and 1051 SPECfp2k at 1.8 GHz . This performance level is a bit shy of the fastest desktop Pentium 4 processors shipping today but is nevertheless quite remarkable when you consider it would be achieved by a 118 mm2 device with 42 W typical power dissipation. Given the way IBM designs its MPUs the 970 is remarkably compact and power efficient compared to heavily engineered products like the Pentium 4 Northwood. Semi-custom MPU design methodologies have obviously come a long way since Sun and TI rolled out the bloated, power hungry, and slow MicroSPARC a decade ago . Even more intriguing for Apple is that the 970’s typical power consumption drops to 19 W at 1.2 GHz which makes it a natural competitor to Intel’s Banias processor for high end mobile applications and very small form factor and/or silent desktop PCs. Given the reduced design margin and greater market emphasis for clock frequency of desktop processors, it is also conceivable that IBM could turn out limited numbers of 970 MPUs that clocked at 2 GHz or higher for high end desktop Macs, an important psychological milestone for Apple’s struggle for survival in an increasingly x86 dominated PC world.
Discuss (86 comments)