A Core Apple Would Die For
It is now quite apparent that the succession of glorified embedded control processors that power Apple’s Macintosh line are the result of the unwillingness of IBM to focus serious resources on that market, rather than a lack of first class MPU architects and circuit designers. The POWER4 processor core is a deeply pipelined, out-of-order execution superscalar RISC design that implements the 64-bit PowerPC instruction set architecture (ISA), but will also run POWER binaries. The more problematic POWER instructions dropped going to PowerPC are handled by a variety of means up to, and including, traps to software emulation routines .
Each POWER4 CPU can fetch and issue 8 instructions per clock cycle, although sustained throughput is limited by the ability to retire a maximum of 5 instructions per cycle. Up to 200 instructions can be in flight simultaneously. This compares favorably even to the venerable Alpha EV6x/EV7 core, which can fetch 4, issue 6, and retire 11 instructions per cycle and maintain up to 80 instructions in flight. The POWER4 core can issue up to 8 instructions per cycle to 2 fixed point (integer) units, 2 load/store units, 2 double precision FP multiply-add units, a branch resolution unit, and a condition code register (CR) execution unit. Following common practice it is likely that the 2 load/store units are also capable of executing simple integer instructions, such as add, subtract, and bit-wise logical operations. The organization of the POWER4 processor core is shown in Figure 2 along with the Alpha EV6x/EV7 core for purposes of comparison. The blocks labeled ‘GCT’ and ‘Decode, Crack, and Group Formation’ seem to be related to POWER4’s strategy of grouping together up to 5 PowerPC instructions in a VLIW-like bundle for the purposes of tracking them through the out-of-order execution engine .
Figure 2 Comparison of POWER4 and EV6x Processor Organizations
Each POWER4 CPU incorporates a 32 KB L1 data cache and a 64 KB L1 instruction cache. The data cache is triple ported with the ability to perform two load operations and one store operation every clock cycle. The L1 control logic supports hardware initiated prefetch for both the instruction cache and data cache, and permits up 11 outstanding cache misses (8 to the data cache, 3 to the instruction cache). The path from the L1 to L2 caches provides in excess of 100 GB/s bandwidth and the ability to perform out-of-order loads. The CPU pair in the POWER4 device share three independent L2 caches. Each L2 cache is 8-way set associative and ‘approximately’ 512 KB in size, and has its own controller that supports 4 outstanding L2 misses. The POWER4 device also includes an L3 cache controller and memory controller. The L3 cache is implemented as embedded DRAM (eDRAM) in a custom VLSI device. The total size of L3 is 128 MB and IBM has not disclosed if the L3 devices are located internal or external to the MCM.
It is difficult to predict with certainty how the IPC performance of the POWER4 will stack up to the EV6x/EV7. With highly complex and finely tuned computational racing engines like these, even minor details in the microarchitecture can have seemingly disproportionate effect on performance on some or many applications. What we do know for certain is that the Alpha design was publicly disclosed more than 4 years ago, and has been shipping in systems for nearly two years. That certainly would have given the POWER4 architects ample opportunity to incorporate newer ideas, and derive lessons from the strengths and weaknesses of its older competitor. Following the same trend towards using deeper pipelining to achieving higher clock rates seen in the Pentium 4, the POWER4 core uses 14 stages in its basic (integer) pipeline. This compares to the 7 stages used in the Alpha EV6x/EV7. The minimum branch misprediction penalty is approximately 10 clock cycles versus 7 for the Alpha. Diagrams of the basic pipeline of the POWER4 and Alpha EV6x/EV7 are provided in Figure 2.
Figure 3 Comparison of POWER4 and EV6x Basic Execution Pipeline
With its autonomous branch unit front end and its multiple condition code fields, POWER and PowerPC proponents sometimes point out how branch misprediction can often be avoided by calculating the branch condition ahead of time. That might have been true in the days of 4 and 5 stage pipelines and limited issue width but in the case of the POWER4 it is difficult to conceive that this can be accomplished by PowerPC compilers, except in rare cases. Combined with the large mispredict penalty this could have seriously compromised IPC (instructions per clock) performance relative to previous POWER and PowerPC processors.
Unsurprisingly IBM outfitted the POWER4 with massively resourced branch prediction hardware. It uses three 16k entry tables, one for local history, one for global history, and one apparently to select between local and global prediction strategies. This appears similar to the ‘tournament’ predictor in the Alpha EV6x/EV7 core, although with much larger tables (the Alpha scheme uses two 1K tables and two 4K tables ). It will be interesting to see what twist IBM has put on their predictor to avoid infringing on Compaq intellectual property (IP).
Be the first to discuss this article!