Sandy Bridge-EP is peculiar because it is clearly intended for 2-sockets, but can be scaled up to larger 4-socket systems. It is the obvious replacement for Westmere-EP. But it will co-exist with the scalable Westmere-EX, which is really meant for 4-sockets and above, in Intel’s product portfolio. This co-existence is primarily due to the transition from PCI-E 2.1 to 3.0 and the economics of larger servers.
Since the volume of larger servers is relatively low, the platform refresh rate is slower so that Intel and their partners can justify the development and validation costs. Intel’s high-end server platforms typically have a lifespan of 2 processor generations, or roughly 4 years. The Boxboro platform debuted with Nehalem-EX in 2010. It is fully compatible with Westmere-EX and will continue until replaced by Ivy Bridge-EX. This makes it impossible for Intel to integrate any I/O or move to QPI 1.1 with Westmere-EX, since that would change the socket and chipset. Upgrading the chipset to PCI-E 3.0 and doubling the I/O bandwidth with the same QPI interfaces is not viable either – the system would be very unbalanced.
Table 1 shows socket level specifications for relevant x86 server microprocessors. The cache shows the total data that is cacheable for a given chip; in the case of Intel this corresponds to the L3 cache. For Interlagos, this is approximately two times the sum of the L2 and L3 caches in each Orochi die. The memory bandwidth is listed in terms of actual data throughput, while the I/O and coherency bandwidth is raw throughput. Both QPI and HT have about 20% overhead for error detection using CRC.
Table 1 – Comparison of Sandy Bridge-EP and x86 server Microprocessors
As the table shows, Sandy Bridge-EP is a tremendous advance over the current Westmere-EP. The last level cache is 66% larger and is partitioned for vastly higher performance using the scalable ring interconnect. At the socket level, the memory bandwidth is 60% higher with 3 more DIMMs to boot. More importantly, the coherency and I/O bandwidth improved by a factor of 2.5X – from 64GB/s to 160GB/s.
The comparison to Westmere-EX is less clear, as it has substantially more QPI links and can be configured in a fully connected, 4-socket topology with a link for I/O. In that scenario, Westmere-EX has about 20% more coherency bandwidth and lower latencies, while Sandy Bridge-EP will have 2.5X more I/O. The memory bandwidth is the same, but Sandy Bridge-EP will have lower unloaded latencies (and power consumption) from directly connected DDR3. Westmere-EX’s buffered memory has higher capacity (16 DIMMs/socket versus 12) and more concurrent memory accesses, since it uses 8 channels of DDR3-1066 with 4 reads/cycle and 4 writes every other cycle. Sandy Bridge-EP should be the clear choice for workloads that are I/O-heavy, while applications with large memory or stringent reliability requirements are better suited for Westmere-EX. Other workloads are unclear at this point.
The real competition for Sandy Bridge-EP though is AMD’s Interlagos, which is based on the Bulldozer architecture and implemented in Global Foundries’ new 32nm process. It is hard to make a precise comparison, since there are so many unknowns regarding Interlagos and Sandy Bridge-EP. The core frequencies have not been disclosed, nor have the frequency and bandwidth off the on-chip fabrics (i.e. the ring for Sandy Bridge-EP and a crossbar for Orochi/Interlagos) or the TDP ratings. However, Interlagos is socket compatible with AMD’s existing systems, so some observations are feasible.
Interlagos uses two chips (Orochi) in a package, connected with one and a half coherent HyperTransport links (3 bytes in each direction). This means that Interlagos systems have twice as many ‘logical sockets’ as ‘physical sockets’. The MCM socket has pins for 4 full HT links, so the first chip uses 5 half-links, while the second has 3 half-links. AMD’s suggested configuration has the first chip dedicating a full link for I/O.
A fully Interlagos will sport 2×8 cores, 8x2MB L2 caches and 2x8MB L3 victim caches (i.e. mostly exclusive). At least 2MB of L3 cache will be used for snoop filters, and there will also be considerable replication between the caches within a single die and within the same package. Accounting for this, Interlagos still has 30-50% more cache per-socket than Sandy Bridge-EP. However, the latency might be better for Intel because of the substantially faster L2 caches and simpler inclusive LLC architecture.
The memory bandwidth of the two designs is similar, with a slight advantage for Intel due to locality within a socket versus AMD’s MCM. The coherency bandwidth is also fairly comparable, since Interlagos will use a HT link to connect to an I/O hub. However, with the integrated PCI-E 3.0 lanes, Sandy Bridge-EP will have 3-4X more I/O bandwidth – creating a clear competitive edge until AMD’s 2012 platform update.
AMD’s snoop filter is a bit more comprehensive than Intel’s inclusive LLC, as it reduces coherency traffic and latency for local (i.e. on the same Orochi die) memory accesses. While this is unlikely to be significant for 2-socket systems, neither Sandy Bridge-EP nor Interlagos can scale to fully connected 4-socket systems. AMD’s snoop filter may prove to be a compelling advantage in that scenario – depending on whether or not Intel implemented any directories.
Based on the available information, Sandy Bridge-EP will have a clear advantage in terms of I/O capabilities and simpler board design over AMD’s competing Interlagos. In other respects though, the server architectures seem evenly matched and will depend on the efficacy of any snoop filtering or directories. When considering the actual microprocessors, a good default expectation is that Sandy Bridge-EP will have an edge in single threaded performance, but Interlagos will hit better throughput for sufficiently parallel workloads. Ultimately though, the balance of single threaded and multi-threaded performance and power efficiency is unknown. It depends on how the on-chip fabrics compare between the two (i.e. ring versus crossbar), the clock frequencies, power management and the actual performance of the Bulldozer microarchitecture. However, most of these uncertainties should be cleared up in the next few months, creating a clearer picture of the server landscape for late 2011 and 2012.
Discuss (104 comments)