IA64: The Phantom Menace Materializes
Intel’s first MPU implementing the complex and arcane IA64 instruction set finally shipped last month after hanging over the heads of high end RISC processor vendors for seven years like a potential sword of Damocles. Sold under the name Itanium, it was widely known as Merced during its painfully drawn out and problem filled development cycle that inspired countless jokes and endless speculation about what went wrong with the project . Sometimes called Unobtainium, the name of a mythical element that aerospace engineers have for decades cynically invoked as the solution to any intractably difficult problem, the Itanium seemed doomed to never get any respect. But Itanium has turned out to be a partial technical success. In its most powerful configuration, 800 MHz clock rate and 4 MB of integrated L3 cache, it turns in a mediocre 370 SPECint_base2k but an industry leading 711 SPECfp_base2k.
It is no accident that the integer performance of the Itanium and UltraSPARC-III both trail that of their competitors, both RISC and CISC. These are the only in-order execution processors in a field of out-of-order execution processors. Nevertheless, it is surprising that Itanium shows no apparent benefit from the plethora of architectural enhancements that differentiate EPIC from classical VLIW, and were promoted as providing much of the benefit of out-of-order execution, but without the implementation costs. These additions include explicit speculative execution, deferred exception processing, advanced loads, register rotation, full predication, and explicit software control of caching and branch prediction strategy on an instruction-by-instruction basis. The fact that the Itanium needs a 50 MHz clock rate advantage to equal the US-III/750 in SPECint_base2k suggests the EPIC features in IA64 currently do little to enhance integer performance. This suggests these architectural enhancements are 1) poorly implemented in Itanium, 2) ineffectively used by current compilers, or 3) inherently ineffective in real programs. It is likely that all three factors are in play, but without more examples of IA64 processors and compilers to go by it is difficult to assess their relative contribution to Itanium’s poor integer performance.
Despite its pronounced deficiencies with integer workloads, Itanium is a powerful MPU for running the floating point intensive technical and scientific applications that are a mainstay for customers of high end RISC processors like Alpha, MIPS, PA-RISC and POWER. Besides excellent FP capabilities, the Itanium also delivers 64-bit flat logical addressing, a large physical address space, and high performance on software based cryptographic kernels. Most importantly of all, Intel can deliver these features, most of the important attributes of 64 bit RISC processors, using the same high volume, low cost merchant chip business model that has proved so successful for Xeon. The primary disadvantages of Itanium are its limited system bandwidth (2.1 GB/s peak, which is a third less than that offered by the Pentium 4), very high power dissipation (up to 130 Watts), expensive, high package count chipset (460GX), and mediocre integer performance. Although the Itanium can directly execute x86 code using a hardware-based compatibility mode, no relevant benchmarks have yet been disclosed. The Itanium’s x86 performance level is rumored to be rather modest, perhaps on the order of a 500 or 600 MHz Pentium II. That’s a level modern RISC processors can easily achieve using proven binary recompilation techniques.
Although the Itanium isn’t as weak for the intended markets as some predicted, any discussion of its shortcomings automatically leads to the subject of McKinley, Intel’s second generation IA64 processor. Its design is said to have been substantially influenced by HP’s experienced and highly respected veteran PA-RISC designers. It is known that McKinley was designed by a team of Intel and HP engineers based in HP’s Fort Collins facilities. As for its reputation within the industry, it should be noted that no matter how much mirth Merced/Itanium provoked from competing MPU designer teams, the laughter stops immediately when the subject changes to McKinley.
The number of hard facts about McKinley in the public domain are few and far between. It offers three times higher peak system bandwidth than the Itanium (6.4 GB/s), and according to Intel it will have more than twice Itanium’s performance. Although the McKinley paper intended for ISSCC 2001 was withdrawn without explanation, some tantalizing details about the new MPU can be gleaned from the abstract in the paper version of the advance program (It was deleted from the on-line version shortly after the paper was pulled): 
15.7 The Implementation of a 1.2 GHz IA64 Microprocessor
S. Naffziger (Intel), G. Hammond (HP, Fort Collins)
This second implementation of the IA64 architecture incorporates 214M transistors in a 6 Al metal, 0.18 um process and operates at over 1.2 GHz. The chip has a 7-stage pipeline and a 3 level cache hierarchy implementated in 4 separate arrays. The caches are separately optimized for latency bandwidth, and density.
It is striking that the 0.18 um McKinley (manufactured in the same basic process technology as the Itanium) can operate at clock rates up to 50% faster than Itanium despite a 30% shorter basic execution pipeline (7 stages versus 10). Clock rate and pipeline length are usually closely related design parameters and it is practically unheard of to simultaneously reduce pipeline length while dramatically increasing clock rate in the same process. This not only implies that McKinley is a very good design, but also that Itanium is remarkably inefficient and/or unbalanced. McKinley is said to have more execution units than Itanium but the same two instruction bundle peak fetch and execute capability as its predecessor. That suggests that the Merced averages far less than its peak two bundles (6 instructions) per clock on most codes, and structural hazards and execution unit oversubscription are an important cause of the shortfall.
Engineering samples of McKinley were reportedly distributed to IA64 development partners starting in February. Intel’s latest public road map for IA64 suggests that it will enter limited (“pilot”) production late this year and general commercial availability early next year. An early press report suggested McKinley would ship at an initial clock rate of 1.4 GHz but this was quickly denied by Intel. The only clock frequency that has been publicly associated with McKinley is the 1.2 GHz from the ISSCC paper abstract, but the withdrawal of the paper casts some doubt even on that. It isn’t known what name McKinley will be sold under, but Intel has invested a lot of effort establishing the Itanium brand name so it wouldn’t be surprising to see it marketed as “Itanium 2”.
Discuss (15 comments)