Intel and HP Caches in Their Chip
The 16 KB, 4 way set associative first level data cache in the McKinley is noteworthy for being both four ported (2 read and 2 write) and achieving a zero cycle load-use penalty. It accomplishes the latter through careful and innovative design and the fact that IA64 supports only one basic memory addressing mode, register indirect. This removes the effective address calculation stage found in nearly every other modern MPU. The design innovation is a new fast access scheme referred to as a “prevalidated tag” cache design. It exploits the fact that the way size (4 KB) is less than or equal to the virtual memory page size defined by the architecture. This means the four cache way data and tag arrays can be accessed using the logical address without the need to deal with virtual address aliasing. The key is, instead of storing physical addresses in the cache tag array as is normally done, the McKinley L1 cache tag stores a one hot encoded match vector that is effectively an index to a TLB entry in the address translation CAM. This allows the hit detection and way selection to be faster because the CAM match lines can be compared directly with the match vectors from the four tag arrays. Thus, physical address lookup and tag value comparison can be replaced with faster hit logic based on AND-OR circuitry . The new scheme is compared to a traditional cache design in Figures 5 and 6.
Figure 5 Traditional Cache Design
Figure 6 Prevalidated Tag Cache Design
The multiplexor that performs way selection in the McKinley L1 cache is also used for data bypass, byte alignment, and endian format selection. It comprises four stages of carefully designed logic, because in the worst case bit position data can come from 48 possible sources (4 ways plus 2 bypass paths times 8 possible byte rotations). The data array is composed of 6T SRAM cells with dual word lines for independent control of the two access devices. This allows true dual ported reads through the use of singled ended read operations at an area cost of only 20% extra area over a standard single ported SRAM cell. Access time is maintained by organizing bit lines hierarchically with only 8 memory cells per local bit line. The entire L1 data cache consumes 3 to 5 Watts during normal operation.
While McKinley’s low latency data cache is important to maximize integer performance, its relatively high miss rate (about 10%) and write-through design means the 256 KB L2 cache will also be kept busy. To help minimize overall integer load latency, all L1 read accesses are also simultaneously issued to the L2. The McKinley L2 is an out-of-order, non-blocking design that supports early speculative load accesses. Up to 32 accesses can be queued for issue to the L2 and each clock cycle queue entries arbitrate for access to one of four issue ports. As in Merced, floating loads and stores bypass the L1 cache and communicate directly with the L2. The L2 is pseudo 16-ported and can support a 128 byte wide L1 cache fill as well as up to four 16 byte wide loads and four 16 byte stores in a single cycle if no bank conflicts occur. At any given time up to 54 accesses can be active without stalling the main L2 pipeline. Because of the speculative nature of the L2 design, access latencies range from 5 to 9 cycles for integer loads, 6 to 10 cycles for floating point loads, and 7 to 11 cycles for instruction accesses.
The McKinley’s L3 cache is unusual in several regards. First of all, it’s there! No previous processor has sported three levels of on-chip cache before. Secondly, it is unusually large, 3.0 MB. This is significantly larger than the 1.75 MB L2 in the EV7, the 1.5 MB shared L2 in the POWER4, and even the 2.25 MB of L1 cache in the PA-8700. A third unusual aspect of the McKinley’s L3 cache is its physical design. Most large on-chip caches in microprocessors tend to be constructed of a small number of very large rectangular memory arrays for maximum cell array efficiency. The McKinley’s L3 is composed of 135 identical 24 KB sub-blocks. Of these, 128 are used to store data, 5 are used to hold EDC check bits, and 2 are used for redundancy. Sub-blocks are wholly self-contained and can be tiled in an arbitrary fashion. This capability is important because the McKinley has an irregularly shaped processor core, as can be seen in Figure 7.
Figure 7 McKinley Floorplan
The sub-block architecture of the L3 cache allows conformal placement around the irregular CPU core and makes possible the McKinley’s surprisingly compact 421 mm2 die area. This is less than earlier reported die size figures for this chip (463 or 465 mm2 depending on the source). Even more remarkable are reports that it is close to the die size of the Merced, a processor sporting only 128 KB of total cache. The L3 cache is 12 way set associative and yields a load-use penalty of 12 cycles. It can sustain one 128 byte load operation and one 128 byte store operation every 4 clock cycles for a peak bandwidth of 64 Gbyte/s at 1 GHz.
The L3 cache occupies 175 mm2, or about 42% of the total die area. Through the use of dense custom sub-block layout the L3 achieves an array efficiency (i.e. fraction of overall area occupied by memory cells) of 85%. In comparison most commercial commodity SRAM devices reportedly average around 70% array efficiency. In addition to dense layout, the sub-block elements in the L3 cache do not contain any redundant row or column elements. All repairs must be accomplished through sub-block substitution. The 128 data sub-blocks are divided into two groups of 64 and exactly one of the 64 sub-blocks in each group can be replaced by the redundant sub-block dedicated to that group. This means that if two sub-blocks within a 64 sub-block group are defective while the other 64 sub-block group is 100% functional it is not possible to fully repair the device even though there are two redundant sub-blocks.
This degree of redundancy seems low at first glance. However there are rumors that Intel will attempt to cover a wider range of price vs performance points by offering McKinley processors with 1.5 MB of L3 cache at a reduced price. This allows devices with two or more defective sub-blocks within either 64 sub-block group to be sold at discount instead of being scrapped. If market demand for “IA64 Celerons” proves greater than the natural supply of partially defective devices Intel has the option of deliberately disabling half the L3 cache of a fully functional device using a laser fuse or bond-out option.
Be the first to discuss this article!