Montecito – Make it a Double and Supersize It
At the beginning the 1990s Intel broke the million-transistor threshold with two 32-bit MPUs, the 486 and the 860. When Montecito ships next year, it will have taken Intel only 15 years to have crossed three orders of magnitude and cruise by the billion-transistor mark. The upcoming dual core 90nm IPF MPU packs 1.72 billion transistors on a single monster die largely because of its 26.5MB of integrated L2 and L3 cache. Based on analysis of public relations photos of 300mm Montecito wafers and recently disclosed die microphotographs it appears that Montecito is roughly 20 x 29mm or 580mm2 in size. The relative die size and floorplan of the Montecito is shown in Figure 3 along with those of its 130nm single core IPF predecessors Madison 6M and Madison 9M as well as fellow 90nm Intel chips Prescott and Dothan.
Figure 3 – Floorplan and Relative Die Size of Montecito and other Intel MPUs
Despite its size, Montecito will likely cost about the same or less than the estimated ~$125 of the existing Madison 6M to manufacture. Silicon cost is based on wafer count and varies weakly with wafer size. Montecito is manufactured on 300mm wafers with nearly 100 candidate dice per wafer, more than the ~63 Madison 6M on a 200mm wafers. Yield is not a specific issue for large memory intensive chips like Montecito, because it is over 70% L2 and L3 cache by area, and regular memory structures are protected against the majority of random point defects using redundant circuit and array elements. The Montecito CPU cores are only about 60mm2 each. The two CPU cores along with about 60mm2 of shared logic total about 180mm2 of non-cache region vulnerable to any defects, about the same critical area as a Willamette based P4 or Celeron.
The Montecito is more than simply dual Itanium 2 CPUs with more cache. Each CPU also incorporates coarse grained multithreading (CMT) in which hardware provides architected processor state for two threads along with logic to automatically switch execution from one thread to the other when the a thread relinquishes the CPU under software control or experiences a high latency event, like an L3 miss. The thread switch time is reportedly 15 cycles, which suggests a full pipeline flush is performed when switching threads. Although this sounds like a significant latency, one must keep in mind that for a 2+ GHz processor like Montecito an L3 miss could otherwise stall a CPU for 20 times longer or more. In addition to CMT, each Montecito CPU implements new IA-64 instructions, has extra functional units for shifting and population count, more efficient speculation recovery, and features for processor virtualization and enhanced reliability, availability, and serviceability (RAS) .
Although the cache hierarchy of previous IPF MPUs are arguable the most advanced of any processor family in terms of latency, bandwidth, and capacity, this was nevertheless an area of major improvement in Montecito. The changes are shown in Figure 4.
Figure 4 – Improvements in Montecito Cache Hierarchy
The biggest change was to split the unified 256KB L2 cache of the McKinley and Madison into separate 1MB L2 instruction cache and 256KB L2 data cache. This was done basically to eliminate instruction stream competition for the bandwidth and capacity of the L2 data cache. The 16KB instruction caches of the Madison and Montecito hold only 1024 instruction bundles which represents about 2.4k useful instructions taking into account the ~20% structural NOP content of a typical IA64 executable. To put that into perspective, that is only about 1/7th the instruction capacity of the POWER5’s 64KB instruction cache. Obviously for many classes of programs, instruction stream fetching will represent a significant portion of the processor requests on the unified 256KB L2 as well a large portion of its contents.
By splitting the L2 caches in Montecito a lot of good things happen. From the data stream perspective, the 256KB L2 suddenly has one less port, and its entire 256KB capacity is available for data. This means less contention and stalls and fewer capacity and conflict misses. This adds up to more predictable memory hierarchy behavior, a very important feature for an architecture that relies heavily on static instruction scheduling. From the instruction stream perspective, the L2 I-cache can be located physically close to the L1 I-cache and its design optimized for the task. It doesn’t need to be multi-ported or support sub-word access. As a result the 1MB L2 I-cache in Montecito likely has little or no latency penalty over the 256KB L2 D-cache, despite having four times its capacity. The combination of a very fast latency (1 cycle) L1 I-cache and large and fast L2 I-cache has operational characteristics are impossible to duplicate in a single level cache. For example, if 90 % of instruction stream accesses that hit in the L1 or L2 hit in the L1, and the L2 has a latency of 6 cycles, than the L1/L2 combination performs like as a single level 1MB instruction cache with average latency of 0.9*1 + 0.1*6 = 1.5 cycles. This is half the latency of the 64KB instruction cache in the Alpha EV6/7 and AMD K7/8.
Information about the L3 caches in Montecito is encouraging but ambiguous. Although L3 capacity is doubled per CPU (12 MB) and quadrupled per device (24 MB) compared to Madison 6M, and L3 latency was said be the same as in Madison 6M and 9M . The ambiguous part is that the Madison 6M’s latency can be described either in absolute terms (9.3 ns) or in processor clock cycles (14). Given the size of the Montecito and the fact that it uses only seven layers of interconnect, just maintaining 9.3ns latency while doubling cache size is a good accomplishment. Keeping L3 latency at 14 cycles while clocking 50+% faster than Madison 6M (i.e. ~6 ns) would be an astounding accomplishment. Rounding out the improvements in Montecito’s cache hierarchy are more efficient L2 and L3 queuing logic as well an increase in the number of L2 and L3 cache line victim buffers.
Discuss (39 comments)