It’s the Memory, Stupid
But higher clock frequency and SMT, are not the primary performance differentiator between the POWER4 and POWER5 architectures. Nearly ten years ago when DEC MPU designers were considering their third generation RISC design, chief Alpha architect Dick Sites reportedly summarized his analysis of the biggest short coming of the second generation EV5 chip with the concise phrase “It’s the memory, stupid”. It is clear now that IBM’s server MPU architects consider Site’s advice as relevant today as it was a decade ago. The most distinguishing characteristics of the POWER5 are the extent that IBM improved upon the POWER4+’s cache and memory system hierarchy. These changes are listed below in Table 1.
| POWER4 | POWER5 |
L1 ICache | direct
mapped | 2
way associative |
L1 DCache | 2-way associative |
4 way associative |
L2 Cache | 1.44MB 8-way associative | 1.92MB 10-way associative |
L3 Cache | 32MB 8-way associative 123 cycle latency 1/3 CPU speed | 36MB 12-way associative 87 cycle latency ½ CPU speed |
Memory | ~ 4 GB/s per device 351 cycle latency | ~16 GB/s per device 220 cycle latency |
Table 1 – Improvements to POWER5 Cache and Memory Hierarchy
The three levels of cache in the POWER5 all have higher degrees of associativity than their counterparts in the POWER4 to reduce miss rates. In addition the on-chip L2 and external L3 are slightly larger. The biggest improvement to cache in the POWER5 is in the latency and bandwidth of the L3. Although it is still external to the processor, it now operates at 1/2 the processor clock rate in the POWER5 compared to 1/3 the processor clock rate in the POWER4. This change increases bandwidth by 50% and reduces latency by roughly a third.
The memory system in the POWER5 has been greatly improved by integrating the memory controller into the processor. In the POWER4 architecture, the memory was controlled by a separate ASIC that was accessed by the processor through the external L3 device. This change provides separate and independent paths between the MPU and L3 cache and the MPU and memory. It has the double benefit of increasing potential operational parallelism and bandwidth as well significantly reducing latency. Instead of requiring six chip-to-chip transfers and four extraneous ASIC crossings to perform a memory read, the POWER5 device directly controls up to four independent channels of DDR or DDR2, albeit through intermediary buffer chips. In a sense IBM is following the same path as Alpha EV7 and AMD Opteron architects in extracting performance gains outside of the CPU core at the system level by bringing memory closer to the processor.
The results of these optimizations are astounding – memory bandwidth per device is quadrupled even as memory latency in terms of processor clocks falls by nearly 40%. The cache and memory performance of the Madison 6M, POWER4+, and POWER5 are provided in Table 2 along with speed and throughput performance as measured by Linpack, SPEC, SPEC_rate2k, and TpmC.
| POWER4 | POWER5 | |
Frequency (GHz) | 1.5 | 1.7 | 1.9 |
L2 Latency | 5 cycles 3.3 ns | 12 cycles 7.1 ns | 13 cycles 6.8 ns |
L3 Latency | 14 cycles 9.3 ns | 123 cycles 72.3 ns | 87 cycles 45.8 ns |
Memory Latency | ~224 cycles ~149 ns | 351 cycles 206 ns | 220 cycles 116 ns |
STREAM Copy BW, 4P | 5.07 GB/s | 8.37 GB/s | 17.9 GB/s |
Linpack GFLOP/s, N=1000, 1P | 5.43 | 3.88 | 5.6 est |
Linpack GFLOP/s, N=1000, 4P | 18.2 | 14.7 | 21 est |
SPECint_base2k | 1408 | 1077 | 1398 |
SPECfp_base2k | 2161 | 1598 | 2576 |
SPECint_rate_base2k,
4P | 63.4 | 48.4 | 74.4 |
SPECfp_rate_base2k,
4P | 82.2 | 66.5 | 125 |
TPC-C (TpmC), 4P | 136k | – | 194k |
The effect of lower latency and higher bandwidth is quite evident. Compared to the 1.7 GHz POWER4+ the 1.9 GHz POWER5 achieves a 30% higher SPECint_base2k score and a 61% higher SPECfp_base2k score with only a 12% higher clock frequency. The difference in throughput is even more dramatic because SPECrate2k testing on the POWER5 was run with 2N copies of the SPEC CPU suite on N processors to take advantage of its SMT capability. The unexpected magnitude of absolute improvement in cache and memory latency and bandwidth is the primary reason why the POWER5 blew past most of my performance estimates in The Battle in 64 bit Land: Merchant Chips on the Rise.
Pages: « Prev 1 2 3 4 5 6 7 8 9 10 Next »
Discuss (39 comments)