ISSCC 2006: Intel Tulsa

Pages: 1 2 3 4

Caching the Train to Tulsa

Naturally, the highlight of Tulsa is the L3 cache, which consists of 256 sub-arrays of 64KB, and 32 68KB arrays, with extra column redundancy. The SRAM cells in the L3 cache measure 0.624um2, slightly larger than the minimum size for Intel’s process. The L3 tags and data have ECC, while the L2 tags are parity protected. Each 36b L3 tag line has 6b of ECC and a redundancy bit, while the 290b data blocks have 32b ECC and 2 redundancy bits. The redundancy bits are used to repair random defects while the chip is at an Intel testing facility, and is independent of the ability to map out bad cache lines in the field (i.e. Pellston). All of Tulsa’s caches use bit interleaving across adjacent cache lines. As a result, when multiple bits are disturbed by a soft error, several single bit errors will occur, rather than a single multiple bit error; the former is significantly easier to detect and fix.


Figure 4 – Sleep Transistors in Tulsa’s L3 cache, courtesy of Intel

To conserve power, cache accesses, only activate 0.8% of all blocks. Furthermore, three level sleep transistors are used to reduce leakage and create different SKUs. Figure 4 illustrates the basic circuits that control the L3 sub arrays. In addition to enabling a sub array, the control circuit can raise Vss, the supply voltage. Raising Vss does not affect the ability of the SRAM cells to retain data, but reduces the leakage power consumption of the SRAM cells. Thanks to these optimizations, along with long Le transistors, the L3 cache is extremely efficient; the average power is 12W or 0.75W/MB.

Tulsa’s immense shared L3 cache is well suited to Intel’s system architecture in three ways. First of all, it decreases the data bandwidth needs of the cores, which is essential, since Twin Castle is using a slower bus than the Blackford chipset (667 or 800MT/s versus 1066MT/s). Secondly, a shared cache decreases the number of coherency snoops in a system. A fully loaded 4S/8P Tulsa system has 4 caches that need to be kept coherent, while a 4S/8P Paxville MP or K8 system needs to maintain 8 caches. While this does not sound terrible, coherency traffic is proportional to the square of the number of caches, which makes non-shared caches much less desirable from a system architecture perspective. Lastly, since the cache is inclusive, coherency snoops will not disturb the L2 caches or below. The shared cache is estimated to increase performance by 10% over a hypothetical design with two private 8MB L3 caches and no sharing.

Pages: « Prev   1 2 3 4   Next »

Discuss (14 comments)