Sizing up the Super Heavyweights

Pages: 1 2 3 4 5 6 7 8 9 10

It’s the Memory, Stupid

But higher clock frequency and SMT, are not the primary performance differentiator between the POWER4 and POWER5 architectures. Nearly ten years ago when DEC MPU designers were considering their third generation RISC design, chief Alpha architect Dick Sites reportedly summarized his analysis of the biggest short coming of the second generation EV5 chip with the concise phrase “It’s the memory, stupid”. It is clear now that IBM’s server MPU architects consider Site’s advice as relevant today as it was a decade ago. The most distinguishing characteristics of the POWER5 are the extent that IBM improved upon the POWER4+’s cache and memory system hierarchy. These changes are listed below in Table 1.

 

POWER4

POWER5

L1 ICache

direct mapped

2 way associative

L1 DCache

2-way associative

4 way associative

L2 Cache

1.44MB

8-way associative

1.92MB

10-way associative

L3 Cache

32MB

8-way associative

123 cycle latency

1/3 CPU speed

36MB

12-way associative

87 cycle latency

½ CPU speed

Memory

~ 4 GB/s per device

351 cycle latency

~16 GB/s per device

220 cycle latency


Table 1 – Improvements to POWER5 Cache and Memory Hierarchy

The three levels of cache in the POWER5 all have higher degrees of associativity than their counterparts in the POWER4 to reduce miss rates. In addition the on-chip L2 and external L3 are slightly larger. The biggest improvement to cache in the POWER5 is in the latency and bandwidth of the L3. Although it is still external to the processor, it now operates at 1/2 the processor clock rate in the POWER5 compared to 1/3 the processor clock rate in the POWER4. This change increases bandwidth by 50% and reduces latency by roughly a third.

The memory system in the POWER5 has been greatly improved by integrating the memory controller into the processor. In the POWER4 architecture, the memory was controlled by a separate ASIC that was accessed by the processor through the external L3 device. This change provides separate and independent paths between the MPU and L3 cache and the MPU and memory. It has the double benefit of increasing potential operational parallelism and bandwidth as well significantly reducing latency. Instead of requiring six chip-to-chip transfers and four extraneous ASIC crossings to perform a memory read, the POWER5 device directly controls up to four independent channels of DDR or DDR2, albeit through intermediary buffer chips. In a sense IBM is following the same path as Alpha EV7 and AMD Opteron architects in extracting performance gains outside of the CPU core at the system level by bringing memory closer to the processor.

The results of these optimizations are astounding – memory bandwidth per device is quadrupled even as memory latency in terms of processor clocks falls by nearly 40%. The cache and memory performance of the Madison 6M, POWER4+, and POWER5 are provided in Table 2 along with speed and throughput performance as measured by Linpack, SPEC, SPEC_rate2k, and TpmC.

 

Madison 6M

POWER4+

POWER5

Frequency (GHz)

1.5

1.7

1.9

L2 Latency

5 cycles

3.3 ns

12 cycles

7.1 ns

13 cycles

6.8 ns

L3 Latency

14 cycles

9.3 ns

123 cycles

72.3 ns

87 cycles

45.8 ns

Memory Latency

~224 cycles

~149 ns

351 cycles

206 ns

220 cycles

116 ns

STREAM Copy BW, 4P

5.07 GB/s

8.37 GB/s

17.9 GB/s

Linpack GFLOP/s, N=1000, 1P

5.43

3.88

5.6 est

Linpack GFLOP/s, N=1000, 4P

18.2

14.7

21 est

SPECint_base2k

1408

1077

1398

SPECfp_base2k

2161

1598

2576

SPECint_rate_base2k, 4P

63.4

48.4

74.4

SPECfp_rate_base2k, 4P

82.2

66.5

125

TPC-C (TpmC), 4P

136k

194k

Table 2 – Comparison of Madison 6M, POWER4+, POWER5

The effect of lower latency and higher bandwidth is quite evident. Compared to the 1.7 GHz POWER4+ the 1.9 GHz POWER5 achieves a 30% higher SPECint_base2k score and a 61% higher SPECfp_base2k score with only a 12% higher clock frequency. The difference in throughput is even more dramatic because SPECrate2k testing on the POWER5 was run with 2N copies of the SPEC CPU suite on N processors to take advantage of its SMT capability. The unexpected magnitude of absolute improvement in cache and memory latency and bandwidth is the primary reason why the POWER5 blew past most of my performance estimates in The Battle in 64 bit Land: Merchant Chips on the Rise.


Pages: « Prev   1 2 3 4 5 6 7 8 9 10   Next »

Discuss (39 comments)