Chip Multi-Processing: A Method to the Madness

Pages: 1 2 3 4 5

Shared Cache CMP (SC-CMP)

The SC-CMP architecture is shown in Figure 2. The red path shows how communication between the cores is routed.


Figure 2 – SC-CMP Architecture

Shared Cache Advantages

The shared cache approach has the advantage of being the lowest latency communication between both cores. As soon as CPU 0 writes to a cache line, CPU 1 can have access without needing to snoop a remote block. This keeps data traffic between the cores away from the I/O fabric, thereby maximizing bandwidth to other devices. Moreover, the interface between the caches and the I/O is only be used for off-chip communication. Another advantage is that a shared cache can be dynamically allocated between both cores. Suppose CPU 0 is heavily utilized and is using all of its allocated cache, but CPU 1 is lightly loaded and requires only 20% of its available cache. In this case, the cache controller can give CPU 0 some of the cache that CPU 1 is not using. Lastly, the shared cache only requires a single bus or network drop, making the interconnect network between the sockets simpler.

Shared Cache Disadvantages

While the benefits come in the form of performance, the shared cache approach is the most complex. This is because the cache controller needs to manage the sharing policy for the cache and handle the dynamic allocation. This latter function requires that the cache controller balance the needs of each CPU and decide how to allocate resources. If this allocation is done incorrectly, corner cases, where one CPU is hogging all of the resources can cause performance problems. Moreover, the cache will need much higher bandwidth, since it now serves two CPUs, not one. Similarly, there is a trade-off between a multiported cache that can serve both CPUs at once, and a cache that simply queues requests from the different CPUs. The former might be slower, but would avoid contention, while the latter would have a lower access time, but would only allow a single access at once. Naturally, all these decisions and features require more design and validation time. It seems unlikely that a CMP design could implement a shared cache unless that feature was architected in from the beginning of the project, which typically takes 3 years from onset to tape out for a moderately evolutionary design. Lastly, the shared cache approach comes with manufacturing disadvantages. For one, the cores cannot be cut or separated to form separate single core products (as we will see is possible with the shared package approach). This means that the TDP for the entire MPU will be very close to twice the TDP for a single core. Secondly, fatal flaws may disable one of the CPUs, which will lead to low-end single core products, while this gives product managers yet another way to differentiate products, it does require yet another SKU.

Shared Cache MPUs

Currently, IBM’s POWER4/5 family and Sun’s UltraSPARC-IV+ use a shared cache approach. Fujitsu’s SPARC64-VI, Sun’s Niagara and Intel’s Yonah and Merom families will also implement a shared cache when they ship.


Pages: « Prev   1 2 3 4 5   Next »

Discuss (23 comments)