Who Should Buy Pentium 4 in the Near Future?
The birth of a new core microarchitecture is always a tentative time because the design it replaces is very mature in both MPU implementation and the targeted optimizations of applications that run on it. Historically speaking, new cores are often criticized for providing little clear cut improvement over the design it replaces, especially considering the new chip on the block is invariably more expensive, has a larger die size, and draws more power. Often, legacy applications run sub-optimally on the new design because the new processor either requires program code to be organized in a specific fashion or certain code sequences to be chosen over others for best performance. This was true when the first 5 Volt Pentiums appeared on the scene and programs compiled for the 486 were often a poor match to the new dual issue superscalar design. And it was true when the Pentium Pro first appeared and mainstream PC operating systems and applications still had substantial portions of 16-bit code which caused performance problems for the new design optimized for 32-bit operation.
It appears that the P4 is no different from Intel’s previous two new core introductions in that respect. Preliminary benchmarks on existing applications show that on many of them the P4 either has little clear advantage over Pentium III or K7 systems, or even falls behind despite the nominally large clock rate advantage. It is clear from the Pentium 4 software optimization guide  that to get the best performance out of the new design on integer and FP intensive applications it is necessary to recompile the applications to utilize the appropriate coding strategies, including favoring certain integer instructions (add, subtract, bit-wise logical) over certain others (shift and rotate), and employing SIMD SSE and SSE2 FP instructions as much as possible. A 256 KB on-chip L2 cache is not particularly generous these days, and the 0.18 um P4 design exacerbates this by quadrupling cache line size to 128 bytes. This is particularly painful for integer code, since the potential for cache thrashing in the face of relatively random memory accesses is made worse by cutting the number of cache lines by 75% while retaining the same 8-way associativity as the PIII’s L2 cache. For the P4, the 128 byte L2 cache line size is a case of short term pain for long term gain, in the form of improved performance in streaming and FP intensive applications. The P4 also includes a hardware prefetch mechanism, although it is unclear to what extent it is being used and what effect it is having on application performance.
The initial Willamette implementation of Intel’s new x86 microarchitecture is a transition product in the same way that the 5 Volt Pentium and the Pentium Pro were. As such, it appeals mainly to the power users who have the flexibility to recompile their applications, or at least be assured their critical application runs well on the P4. But for them, even the initial P4 provides a remarkable opportunity to achieve similar integer, floating point, and memory bandwidth performance of contemporary high-end Alpha RISC-based systems at a fraction of the cost. But for the majority of PC buyers, existing PIII and K7 systems are nearly as good (and often better) and available at a lower price. There are also system upgradability issues related to the short-lived 423 pin socket interface and the high cost and limited availability of Direct Rambus (DRDRAM) memory modules (RIMMs) that will inspire many to wait for P4, and its infrastructure and software, to mature before adopting it for mainstream use.
Discuss (10 comments)