High Level Choices
The Sandy Bridge architecture was disclosed last year at Intel’s Developer Forum, and our report thoroughly describes the microarchitecture. However, Intel’s announcement carefully avoided mentioning the details of the physical design and implementation. Some of these details came out at launch (such as the overall die size), but most were still held back. At ISSCC 2011, there were well over half a dozen papers on server and mobile microprocessors, spanning two different sessions. A first paper described the client versions of Sandy Bridge, while a second paper focused on the server variant (Sandy Bridge EP). The paper on Sandy Bridge contained particular insight into the design philosophy and focus at Intel.
Intel’s presentation made it clear that one of the single most critical constraints facing Sandy Bridge was configurability. A single design team needed to create several different products spanning notebooks, desktops and single socket servers. From a macro physical standpoint, each product varies the number of cores (2, 4), size of the L3 cache (3MB, 4MB, 8MB) and number of shaders in the GPU (6, 12). Products are further differentiated via simultaneous multi-threading, peak frequency in turbo mode – although these do not impact the overall design substantially.
This configurability dictated the 32B ring interconnect topology (with 4-6 stops) used in Sandy Bridge. The system agent, GPU and each core and slice of the L3 cache is a single ring stop. It is very simple to add or remove stops on a ring. A ring with 10 stops looks very similar to one with 4 stops from a functionality standpoint, although performance is noticeably different. This means that substantially less validation is needed for each different product.
The downside is that moving cache lines across a ring can be expensive. The farther a cache line travels, the greater latency and the lower the overall bandwidth available. For example, a 64B cache line that crosses 4 hops would consume a total of 256B of interconnect bandwidth and take 5 cycles; a 2 hop trip would consume only 128B and take 3 cycles. In contrast, all trips on a 32B crossbar are a single hop, so sending a cache line would consume 64B and take 2 cycles. However, crossbars use substantially more area because every agent on a cross bar must be fully wired to all other agents. In Intel’s case, the big problem is that a crossbar for 6 agents is very different design from a crossbar for 4 agents and would require entirely separate design and validation. This is an example of Intel specifically choosing a simpler and less expensive option, rather than aiming for the highest possible performance.
Figure 1 – Sandy Bridge Die Photo and Chop Options
There are three versions of Sandy Bridge: 4 cores, 8MB L3 and 12 EUs; 2 cores, 4MB L3 and 12 EUs; 2 cores, 3MB and 6 EUs. Respectively, these three versions are 216mm2, 149mm2 and 131mm2. Figure 1 is a die photo of the quad-core with the various chop axes labeled. It is interesting to note that the 4C version actually wastes around 8mm2 in the upper right corner, while there is barely any wasted area in the dual core versions. Obviously, Intel is far more concerned about the die size of the high volume dual-core products, than the quad-core versions.
Looking at the key macroblocks shows an interesting break down of the area in the quad-core Sandy Bridge. The 4 CPU cores and their private L2 caches occupy 74mm2. The L3 cache and the bulk of the ring interconnect (which is on the same power plane as the cores) is 43mm2. The fully featured 12 shader GPU (called GT2) takes up 38mm2. The defeatured version (GT1) is about 55% of the size of the GT2 block. Some of the logic in the GPU, such as the ring stop (nearly 2mm2) and fixed function hardware cannot be removed in tandem with the shaders. The 18mm2 system agent includes the rest of the digital logic, including the DDR3 memory controller, a 20-lane PCI-E gen2 controller, the power management unit and display engines. Lastly, the physical layer interfaces for DDR and I/O comprise 12mm2 and 31mm2, respectively. The I/O block on the right side also includes a fair bit of on-die debug logic, but for the most part the physical layers are analog and mixed signal used for PCI-E and Display Port.
These statistics show Sandy Bridge spends twice as much area on the CPU cores as the GPU. In contrast, AMD seems to spend equal or more area on their integrated GPUs. Nearly a 30-40% of AMD’s Bobcat and Llano are taken up by the GPUs, while Intel has only allocated around 15-25%. This goes a long way to highlight where Intel and AMD have respectively placed emphasis in their designs and why AMD will undoubtedly offer better graphics performance. In future generations though, we expect to see Intel spending relatively more die area on their GPUs, perhaps closer to 25-35%.