McKinley: Little more Logic, Lots more Cache
The most striking aspect of McKinley is its size and transistor count. Weighing in at a hefty 220 million transistors, this 0.18 um device occupies a substantial 465 mm2 of die area. The majority of McKinley’s transistor count is tied up in its cache hierarchy. It is the first microprocessor to include three levels of cache hierarchy on chip. The first level of cache consists of separate 16 KB instruction and data caches, the second level of cache is unified and 256 KB in size, and the third level of cache is an astounding 3 MB in size. The die area consumed by the final level of on-chip cache can be seen in the floorplan of the McKinley and some representative server and PC class MPUs shown in Figure 1.
Figure 1 Floorplan of McKinley and Select Server and PC MPUs.
The Itanium (Merced) floorplan is shown as blank because although its chip floorplan has been previously disclosed its die size is still considered sensitive information by Intel and has not been released. The outlines shown indicate the range of likely sizes of the Itanium die based on estimates from a number of industry sources.
Both the first and second generation IA64 designs, Itanium/Merced and McKinley, are six issue wide in-order execution processors. In-order execution processors cannot execute past stalled instructions so it is important to have low average memory latency to achieve high performance. This focus on the memory hierarchy can be clearly seen in the McKinley . Although it is not surprising that the on-chip level 3 cache in McKinley is much faster than the external custom L3 SRAMs used in the Itanium CPU module, it is interesting to see how much faster in terms of processor cycles the McKinley level 1 and 2 caches are despite the McKinley’s 25 to 50 percent faster clock rate in the same 0.18 um aluminum bulk CMOS process.
The improvement in average memory latency between Itanium and McKinley can be approximated using the comparative access latencies presented by Intel at their last developers conference, combined with representative hit rates based on the size of each cache in the two designs and an assumed average memory access time of 160 ns. This data is shown in Table 1.
Global Miss rate
Global Miss rate
Average Latency (cycles)
Average Latency (ns)
The back of the envelope type calculations in Table 1 suggests that a load instruction will be executed by McKinley with about half the average latency in absolute time than it would on Itanium. No doubt this is a major contributor to the much higher performance of the second generation IA64 processor. Although the large die area of McKinley suggests a substantial cost premium compared to typical desktop MPUs, for large scale server applications the extra silicon cost is insignificant compared to the overall system cost budget. In fact, from the system design perspective, the ability to reasonably forgo board level cache probably more than pays for the extra silicon cost of McKinley through reduction of board/module area, power, and cooling requirements per CPU. Large scale systems based on the EV7 will also eschew board level cache(s), although with the Alpha it is the greater latency tolerance of the out-of-order execution CPU core plus the integration of high performance memory controllers that permit this, rather than gargantuan amounts of on-chip cache.
Besides the greatly enhanced cache hierarchy, the McKinley will boast two more “M-units” than Itanium. These are functional units that perform memory operations as well as most type of integer operations. In a recent article I speculated about the nature of McKinley design improvements. I suggested that it would contain 2 more I-units and 2 more M-units than Itanium in order to simplify instruction dispatch and reduce the frequency of split issue due to resource oversubscription. In IA64 parlance, both I-units and M-units can execute simple ALU based integer instructions like add, subtract, compare, bitwise logical, simple shift and add, and some integer SIMD operations. I-units also execute integer instructions that occur relatively infrequently in most programs but require substantial and area intensive functional units. These include general shift, bit field insertion and extraction, and population count.
Because the integer instructions that cannot be executed by an M-unit are relatively rare, the McKinley designers saved significant silicon area with little performance loss by only adding two M-units (for a total of four) and staying with the two I-units of Itanium. Data on the relative frequency of different integer operations suggest that the vast majority of integer operations, about 90%, that occur in typical programs are of the type that can be executed by either an M-unit or I-unit . If we consider a random selection of six integer operations, each with a 90% chance of being executable by an M-unit, then the odds are better than 98% that any six instructions are compatible with the MMI + MMI bundle pair combination and can be dual issued by McKinley. Thus there is practically no incentive to add two extra I-units to McKinley to permit the dual issue of the MII + MII bundle pair combination.
One curiosity in the McKinley disclosure was the fact that the basic execution pipeline was revealed to be 8 stages long. Although this is still 2 stages shorter than the pipeline in the slower clocked Itanium, it is one more stage than the 7 stages previously attributed to McKinley . Whether this represents a slightly different way of counting the pipe stages or an actual design change isn’t clear. Ironically, it has long been rumored that the Itanium pipeline was stretched by at least one stage quite late in development. It will be interesting to see if the new IA64 core under development by the former Alpha EV8 design team (now at Intel) also suffers this strange pipeline growth affliction.
Be the first to discuss this article!