Pentium 4 Floats Away From Pentium III, but Latency Keeps It Earthbound
On floating point intensive code the situation is far more dramatic. As shown in Table 1 the P4 is pulling down SPECfp2k scores that are only exceeded by the fastest speed grade of the RISC performance king, the Alpha EV67. This is even more remarkable when you consider that Intel plans to transition their desktop MPU shipments to a majority of P4s by the end of next year, so we are talking about a processor that will be in the PC mainstream in rather short order. (In fairness it should be mentioned that the EV67 is a 0.25 um device, and much higher performances will be achieved when 0.18 um Alphas EV68 and EV7 devices start shipping).
As with the case with SPECint2000, it is not valid to assess the relative efficiency of the P4 and PIII cores on floating point intensive code at different frequencies by using simple clock normalization. Employing the same methodology as previously described it is possible to extrapolate PIII SPECfp2000 performance to clock frequency of the P4. The PIII SPECfp2000 performance/frequency scaling factor from 1000 MHz to 1133 MHz on the 820 platform using IRC 5.0 compilers is about 41%. That would put the SPECfp2000 score of a hypothetical 1.4 GHz PIII at no more than 362. In comparison the 1.4 GHz P4 yields 538. That gives the P4 microarchitecture a relative IPC advantage of at least 49% on FP intensive code at 1.4 GHz. The SPECfp2000 performance and scaling of the PIII and P4 with frequency is shown in Figure 2.
Figure 2 SPECfp2000 Performance of PIII and P4 as a Function of Clock Rate
The P4’s SPECfp2000 scores show that the double precision FP SSE2 SIMD instructions added to the x86 instruction set architecture have helped alleviate the large FP performance disadvantage associated with the x87 stack based architecture. It also seems to vindicate AMD’s recent decision to forego their TFP floating point architecture extensions in future products in favor of SSE2. Unfortunately for Intel, a large portion of PC floating point intensive applications have not been compiled to take advantage of SSE or SSE2. Ironically, the P4 will take its fair share of lumps for relatively mediocre performance on x87 based FP intensive code while laying the groundwork for SSE2 based software optimization that will eventually benefit AMD.
Another open question is, to what extent is the P4 pumped up by the 850 chipset? A comparison of the PIII operating on the dual Rambus channel 840 chipset with the PIII operating with the single Rambus channel 820 chipset gives some clue. At 1000 MHz the PIII gets a benefit of 3% on peak SPECint2k and 7% on peak SPECfp2k from the use of the 840 chipset instead of the 820. Unfortunately, that doesn’t tell the whole story because both chipsets have more potential memory bandwidth than the PIII system bus is capable of utilizing.
The primary difference between the 840 and 820, as far as performance goes, is the slightly lower latency with the 840 because a 32-byte PIII cache line can be fetched with two 16-byte read operations issued in parallel on the 840, while the 820 requires two sequential 16-byte read operations. However, while the P4 system bus offers about three times the bandwidth of the PIII system bus, it likely also has greater latency. That’s because the bus uses 100 MHz bus cycles with four data beats per cycle to achieve its 400 MHz peak data transfer rate. So the P4 bus interface appears to the P4 processor core to operate with a 10ns timing granularity compared to 7.5ns for the PIII. In addition, two-way communications between the P4 and the 850 suffers an average of 5ns more latency on every bus transaction (2 * 10ns versus 2 * 7.5ns).
Discuss (10 comments)