Sandy Bridge for Servers

Pages: 1 2 3

Introduction

Intel released a new generation of consumer CPUs based on the Sandy Bridge microarchitecture earlier this year. Some variants have disabled graphics and are being marketed for single socket servers. The highest speed consumer parts have a base/peak frequency of 3.4GHz/3.8GHz with a 95W TDP. The server products can actually reach 3.6GHz/4GHz within the same power envelop, thanks to savings from permanently disabling the GPU. However, these are still largely the same as desktop parts, albeit with ECC support for the DDR3 memory. They rely on PCI-Express and DMI for I/O to the outside world and have no support for the Quick Path Interconnect.

The real server version, Sandy Bridge-EP goes by the codename Jaketown. It was described earlier at ISSCC 2011 and is an entirely different chip that will debut in Q4 of 2011. As with all of Intel’s recent server products, Sandy Bridge-EP re-uses the core microarchitecture, but with a specially designed system architecture (or uncore). Sandy Bridge-EP is fabricated on Intel’s 32nm process and is described as a 400mm2 chip with up to 8 cores, a shared last level cache (LLC), integrated DDR3 memory controllers, PCI-Express 3.0 and Quick Path Interconnect 1.1 for multi-socket scalability.

Sandy Bridge-EP is designed to be fairly configurable to target different markets. There are two sockets, the high-end socket R (LGA2011) and the more cost optimized socket B2 (LGA1356). Compared to the desktop Sandy Bridge (LGA1155), there are additional pins for I/O and power and ground. Unlike previous generations, there will be no separate EX version of Sandy Bridge – that will wait until the 22nm Ivy Bridge. The TDP was not disclosed, but is probably as high as 150W given past server products and the newly integrated I/O.

The cores share the same Sandy Bridge microarchitecture as the consumer parts, but are designed for 0.85V-1.1V operation (versus 0.65V-1.05V) to ensure stability. The cores are rumored to run at up to 3GHz, based on the latest leaks. An older presentation and competitive comparison from Tilera indicates 2.66GHz. The most likely scenario is a 2.66GHz base frequency for a 130W part, while a 150W products can reach 3GHz. In either case, the peak frequency is likely to be at least 400MHz higher, given that Westmere-EX runs at 2.4GHz/2.8GHz. So that puts the peak frequencies for Sandy Bridge-EP at ~3GHz for the highest bin. The cores are largely unchanged from the consumer version. However, the system architecture is substantially different and bears a strong resemblance to Nehalem-EX and Westmere-EX.

Last Level Cache and Ring

The last level cache (LLC, or L3 cache) in Sandy Bridge-EP is an inclusive, distributed design that focuses on bandwidth and latency for a multi-processor system. The size of the cache has not been disclosed, but based on our measurements, it occupies roughly 116mm2; by comparison, the Westmere-EX LLC is 30MB and 199mm2. Assuming the same density, that suggests a ~17.5MB cache for Sandy Bridge. Given these facts, the most likely scenario is that the density has been modestly improved and each slice of the cache in Sandy Bridge-EP is 2.5MB for a total of 20MB across the die. Based on Intel’s presentation, there will also be a 6-core variant, perhaps for high-end desktops and some entry-level servers.

As with prior Intel designs, cache lines are hashed (by physical address) across the different slices to prevent contention and improve bandwidth. Each cache slice contains its own controller, which is responsible for servicing accesses and coherency traffic. The LLC design is similar to Nehalem-EX, so 4 accesses can occur in parallel each cycle. The inclusive LLC has core valid bits that indicate which core (if any) might have a copy of a cache line, so that it acts as a snoop filter for all coherency traffic – intra-chip and inter-chip.

Sandy Bridge-EP continue to use a ring interconnect to tie together the chip. There are 11 different agents on the ring: 8 slices (core + LLC), 1 QPI, 1 for memory and 1 agent for I/O. Additionally, there are 3 stops on the ring that were inserted for timing purposes. Unlike the consumer version though, the data bus is bi-directional – with one 32B ring running in each direction. Each agent interleaves access to the two rings on a cycle by cycle basis to prevent collisions and avoid starvation. A full 64B payload takes 2 cycles to deliver (with a cycle of latency for the interleave). The advantage of this approach is that all routing decisions are made at the source of a message and the ring is unbuffered, simplifying the overall design.

Using counter-rotating rings cuts the number of hops between any two agents in half. This is critical for a server, since the available bandwidth from the LLC is reduced by the number of hops each transaction must travel. Additionally, each hop on the ring adds a bit of latency to an LLC access. The cores, ring and LLC share a power domain, but only the cores are power gated. Even if several cores are inactive, the LLC must stay awake to respond to any coherency traffic.

The clock design for the ring and LLC is dramatically different compared to the consumer variants. On Sandy Bridge, a single 100MHz PLL feeds a reference clock to all four slices (cores and the LLC). Each slice has another PLL that multiplies the frequency up to 3-4GHz, so the LLC and ring interconnect ran at core frequency with 1 cycle latency for each hop on the ring. The actual frequency of the Sandy Bridge-EP ring and LLC is unknown, but should be roughly 1-1.5GHz to achieve sufficient bandwidth for the memory controllers and I/O.


Figure 1 – Sandy Bridge-EP Clock Domains

Sandy Bridge-EP is much larger and there was no way to maintain 1 cycle latency for each ring stop because of clock skew and jitter. This is a very real problem for Sandy Bridge-EP, adding several cycles of latency to every LLC hit. Additionally, when a core misses in the LLC, it must snoop the other sockets to check their caches. If it takes two clocks to move across one ring stop, the latency for a memory access will go up by roughly 2(N+4) cycles (where N is the number of cores). That would translate into a real penalty of 24 cycles or nearly 10ns.

For these reasons, Sandy Bridge-EP has a clock domain and PLL for the ring interconnect and LLC that is separate from the cores. When data crosses from the core clock domain to the ring, there is a one cycle penalty. However, there are no latency penalties within the ring and LLC – each hop takes a single cycle, and this approach achieves better overall performance. Figure 1 shows Sandy Bridge-EP with a different color for each clock domain, note that the diagram appears to omit 2 of the ring stops that were inserted for timing.


Pages:   1 2 3   Next »

Discuss (104 comments)