Itanium: The Juggernaut Picks up Speed
Hardly a week goes buy without a major computer OEM announcing a shiny new product line based on the Itanium 2: first HP, then NEC, followed by Unisys and SGI. IBM is close to announcing an IA64 based mid range system while Dell seems to be held back by delays in Intel’s 8870 chipset rather than a lack of intention. The Itanium 2 is an impressive product, with leading edge FP performance as well as respectable integer performance for a server class processor. Early indications are that it also does very well on commercial workloads. This is not surprising given the huge capacity, bandwidth, and support for parallelism within its three level on-chip cache hierarchy as well as the fact that Itanium 2’s non-FP functionality was optimized for commercial workloads rather than SPECint2k .
Although the selection of hardware and OEMs to choose from is impressive for a new instruction set architecture, the speed at which IA64 can penetrate the workstation and server markets is limited primarily by the availability of application software. Although the Itanium 2 offers compatibility for 32 bit x86 applications (in hardware, not by emulation, as is often erroneously reported) the performance penalty incurred is substantial enough to render the feature useless in terms of driving hardware sales ahead of the availability of native 64 bit software. Especially valuable would be a native version of the Windows operating system. Although this is in the works Microsoft has never been known for timely support of a new architecture. However, the rapidly growing presence of Linux across the breadth of the server market, and the serious effort by Intel and its hardware partners to support Linux on the full range of IA64 systems will be a strong impetus to Microsoft not to waste any time in joining the party.
The near term road map of the Itanium family seems quite clear. Like the 0.5 um P6, the 0.35 mm EV6, and 0.18 mm Willamette, the Itanium 2’s McKinley core has multiple process shrinks ahead of it. The results of the first shrink, to 0.13 mm, are Madison and Deerfield which will be introduced later this year. Both are based on the same core and are primarily differentiated by the size of the L3 cache. Madison is the high end part and will double the Itanium 2’s 3 MB L3 cache size to a massive 6 MB and increase associativity from 12-way to 24-way. Despite the extra cache the Madison will be somewhat smaller than the Itanium 2, reportedly 374 mm2 compared to 421 mm2. In contrast, the Deerfield will keep the 3 MB L3 size of the Itanium 2 and ride the process shrink to a much smaller die size, ~266 mm2 or about the size of the POWER4+.
The clock speed of the McKinley core in 0.13 mm has been disclosed as 1.5 GHz in the abstract of an associated paper in the preliminary program for ISSCC 2003. Given the difficulty with interconnect delay and signal integrity in the 0.18 mm aluminum Itanium 2 device, and the fact that the Pentium 4 core has already seen over 50% higher clock rates in the 0.18 to 0.13 mm shrink, one might expect the Madison would clock faster rate than 1.5 GHz in a 0.13 mm copper process. Intel is likely once bitten, twice shy over disclosing aggressive clock rate targets for unreleased IA64 products in technical papers. The abstract of the McKinley paper famously withdrawn from ISSCC 2001 described it as a 1.2 GHz device. Many of the papers presented at ISSCC 2002 described critical blocks of McKinley and indicated they were fully qualified up to 1.2 GHz. Seeing as there is no Itanium 2 faster than 1.0 GHz one could theorize that the top clock rate was cut by 200 MHz to hit a thermal design target that OEMs refused to budge on . There is a strong possibility that Madison will be available at clock rates higher than the 1.5 GHz disclosed once the 0.13 mm device is fully characterized.
Whether Madison and Deerfield clock 50% or 60% faster than McKinley or more, one thing seems clear is that performance as measured by SPEC CPU 2000 will scale upwards very closely with any clock frequency increase. Detailed processor CPI component breakdown and memory access latency contribution breakdown data for Itanium 2 was presented in . Analysis of the data presented suggests performance scaling factors of 0.95 and 0.89 for SPECintbase2k and SPECfpbase2k respectively for Itanium 2. Increasing the size of the L3 cache to 6 MB, while keeping latency constant, is estimated to raise the SPEC scaling factors to about 0.96 and 0.92 respectively. Another obvious lever for increasing performance is the system interface. The Itanium 2 has a 128 bit wide data bus effectively clocked at 400 MHz. It uses similar signaling technology to the Pentium 4 front side bus which is targeted to hit 800 MHz later this year. Even a conservative increase to 533 MHz would increase the memory and I/O bandwidth available to the McKinley core by a third. Inside the processor core a number of minor changes could have been made during the port to 0.13 mm to enhance performance. One relatively simple change would be to increase the number of stacked general purpose registers from 96 to 128. Although the change would be transparent to software the data presented in  suggests this would cut the number of register stack engine spills and fills associated with procedure calls and returns by more than half on average.
Beyond Madison and Deerfield, further shrinks of the McKinley core are planned. The next process node is 90 nm. One rumor that keeps recurring is that Intel will incorporate two McKinley cores in a high end device in the 90 nm generation. This 2 way chip level multi processor (CMP) is similar in concept to the existing IBM POWER4 design and HP’s future PA-8800. The ultimate in down-the-road speculation for the Itanium family is what the former EV8 design team is cooking up in Intel’s Shrewsbury design center. Sources within Intel indicate excitement about the new design concepts incorporated in this brand new IA64 processor core which will likely target 65 nm. The possibilities range from out-of-order execution, hardware multithreading, and vector style computing extensions like the “Tarantula” extension to EV8. Whatever path was chosen, vendors of 64 bit MPUs that compete with IA64 are faced with the scary prospect that the former Alpha design team, which created the highest performance processors at virtually every process technology node it tackled, is largely intact and hard at work with Intel’s process development and manufacturing capabilities at its disposal.
Discuss (86 comments)