The Mountain Comes into Focus
As mentioned before, ISSCC 2002 attendees were treated to no less than six papers on various aspects of the Intel/HP McKinley. These included presentations on the overall design, the integer register file and functional units, the clock design, and each of the three levels of cache. Like the Itanium/Merced, the McKinley is a dual bundle (six instruction) issue wide implementation of the IA64 instruction set architecture. But the similarities pretty much end there. While the initial Itanium is an bafflingly inefficient, awkward, and nauseatingly bloated design, the McKinley clearly shows that a great deal of careful thought and occasional flashes of inspiration went into the crafting of virtually every aspect of this second generation IA64 processor. The basic organization of McKinley is shown in Figure 3.
Figure 3 Block Diagram of the IA64 McKinley
The McKinley is an in-order processor that allows out-of-order instruction completion. Unlike dynamically scheduled processors like the Pentium 4 and EV6, it is fully interlocked with no cache access way-prediction or flush/replay mechanisms. Like Itanium/Merced, the second generation IA64 processor retains IA32 compatibility functionality in hardware. McKinley has the same number of branch (“B”) units, floating point (“F”) units, and complex integer (“I”) units as Merced, but doubles the number of simple integer/memory (“M”) units from 2 to 4. As I described in detail in a previous article, the extra 2 “M-units” are critical to increasing the fraction of useful instruction bundle pair combinations that a two bundle issue wide IA64 processor can actually dual issue. This is important because every split issue lowers the IPC of up to six instructions by 50% or more. In addition, the McKinley uses an 8-stage basic pipeline, 20% shorter than the 10-stage pipeline in Merced. Each additional stage in an IA64 processor was described as contributing 2% to 5% performance loss due to branch misprediction penalties alone. The basic execution pipeline of the Merced and McKinley processors are compared in Figure 4. The McKinley saves two pipe stages by accessing its instruction cache and register file in one clock cycle instead of two. Despite this repartitioning, the McKinley clocks 25 to 50% faster than Merced although both are manufactured in the same process.
Figure 4 Merced and McKinley Basic Execution Pipeline
While the extra integer execution resources and shorter pipeline are important, the Intel and HP designers directed their biggest attack against IPC-sapping inefficiency on the memory hierarchy. The load-use penalty for memory accesses resolved in the first, second, and third level of caches are respectively 0, 5, and 12 cycles. This greatly eases pressure on IA64 compilers to re-organize code to try to hide load latency, since an integer load can be treated the same as any other single cycle integer operation. Improvements in cache latency alone was said to increase performance by 15 to 20% over Merced, while enhanced bypassing capabilities within the L2 cache was said to improve performance by a further 8% or 9%.
Another big improvement in McKinley is the handling of branches. Unconditional branches and predicted taken conditional branches are processed with zero overhead due to a low latency instruction cache that also stores branch target address and history information in each line in a manner reminiscent of the EV6. This a big improvement on Merced, which incurs a 1, 2, or 3 cycle taken branch penalty depending on the type of branch and how it is handled. The McKinley’s branch prediction logic is quite aggressive, with the instruction cache based 1K set of branch histories backed up by a 12 K entry branch history cache and a 16k entry 2b pattern history table. This scheme was said to offer 95% prediction accuracy on SPECint2000 and 93% accuracy on TPC-C ( the optimization target application). The Merced’s branch prediction logic is primitive in comparison. Based on a 512 entry 2b pattern history table, its misprediction rate is likely two to three times worse, a fact made even more painful by the Merced’s 9 cycle misprediction penalty, 2 cycles more than for McKinley.
Be the first to discuss this article!