Last year at ISSCC, Intel described several details of Sandy Bridge server SoCs with enough information for informed speculation of actual product details. Our earlier article captured the overall picture well, although some of the more interesting details were unavailable. The actual Xeon E5 products experienced a peculiar delay till March 2012, rather than late 2011 as some initial roadmaps suggested. The peculiar aspect is that quite a few systems using these processors showed up on the Top 500 list for HPC systems in November of 2011. So products were available, but only for a restricted set of customers, either those with very specific needs or strict milestones in the sales contracts.
Intel’s Sandy Bridge core first debuted in consumer-oriented SoCs in early 2011. The new microarchitecture sported a micro-op cache, re-designed out-of-order execution with 256-bit wide AVX units and most importantly, an L3 cache and system agent that run at core frequency. For consumer variants, the robust Sandy Bridge GPU was integrated to the SoC, sharing the L3 cache and system agent and an aggressive power management unit had unified control over the CPU cores and GPU.
Sandy Bridge-EP continues Intel’s strategy of extensive design re-use between consumer and server products. The actual SoC re-uses the existing design and pairs up to 8 cores with an entirely new system architecture. The system architecture is fundamentally what differentiates consumer and server SoCs and it encompasses a number of areas such as the last level cache (LLC), memory controller, QPI 1.1, integrated I/O and power management.
The actual Xeon E5 products based on Sandy Bridge-EP will eventually span the 1-4 socket server market. The E5 essentially replaces the previous generation Westmere-EP for 1-2 socket servers. The unusual part is that it also overlaps with Westmere-EX for 2-4 socket servers. As a result, the E5 is a bit of a hybrid, with more scalability than typical 2-socket server products. This makes the comparison between Sandy Bridge-EP and previous generation products much more interesting. In this article, we will give an overview of the key features of Sandy Bridge-EP and a subsequent piece will present our benchmark results.
Cache, Memory and Coherency
As we speculated, the LLC is 20MB, with 20-way associativity. We significantly under-estimated the ring bus and LLC frequency at 1-1.5GHz. It turns out that the cache and two 32B data rings in the system fabric run at core frequency providing up to 844GB/s of fabric bandwidth, which is amazing given the power constraints. This reduces the cache and memory latency and also simplifies the validation. More importantly, it is a huge improvement in bandwidth over Westmere-EP, where the LLC was implemented as a single partition. In that respect, the Sandy Bridge-EP cache is much more similar to the highly scalable Westmere-EX.
The Sandy Bridge-EP memory controller has been modestly enhanced with 4 channels of DDR3 at up to 1.6GT/s, whereas the 3 channel Westmere-EP was limited to 1.33GT/s. Like Westmere-EP, each channel can run 1-2 DIMMs at full speed, but with 3 DIMMs the memory bandwidth drops by a single grade. The number of memory transactions in-flight has grown from 96 to 128 to match the increased bandwidth and additional cores. Additionally, there is support for higher capacity LR-DIMMs.
More impressive is that the additional memory bandwidth has not increased latency. According to Intel’s measurements, the idle latency is nearly the same as Westmere-EP. This is a pleasant surprise, given the extra latency for Westmere-EX’s ring. There were several architectural contributors to this accomplishment. First, coherency snoops are sent from the cache to the other socket concurrent with the memory request to the home agent, as opposed to having the home agent send out the snoop request. Second, data responses can be sent from the memory controller or QPI directly to the requesting core, rather than waiting to obtain the data from the LLC. Last, there is a new LLC prefetcher that is better at avoiding contention with regular accesses.
Sandy Bridge-EP is the first processor with QPI 1.1 for coherent communication. QPI 1.1 has several key architectural changes, primarily shifting from source snooping to a home snooping coherency protocol. Home snooping was already used by Intel’s Itanium and 4-socket x86 servers, so the entire product portfolio now uses the same basic techniques. The physical address space has expanded to 46-bits as well. At the physical level, the transfer rate has gone up to 8GT/s. For 4-socket servers using Sandy Bridge-EP, there will also be snoop filtering to minimize the coherency traffic and reduce memory latency.
Discuss (17 comments)