Memory and I/O Integration
Sandy Bridge-EP has two home agents, which contain all the logic for the memory controllers and are responsible for ensuring system-wide cache coherency. Sandy Bridge-EP has 4 channels of DDR3, but unlike Westmere-EX, it does not use any buffering techniques. Using regular memory has a slight latency and power benefit, but it also means that adding more DIMMs to the system will reduce the frequency and memory bandwidth. The DDR3 memory controllers operate at 1.6GT/s and can drive 3 DIMMs/channel (1.33GT/s with all DIMMs populated). The socket B2 versions will have only 3 channels of memory to reduce die area and save pins.
Additionally, Sandy Bridge-EP integrates PCI-Express 3.0, which delivers twice the bandwidth per pin and is backwards compatible with version 2.0. PCI-E 2.0 has a 5GT/s signaling rate, but encodes every byte of data as 10-bits for better electrical characteristics. While the 25% overhead is undesirable, it is necessary to achieve such high transmission rates. So a single lane has 1GB/s of bandwidth including both transmit and receive.
The physical layer for PCI-E 3.0 runs at 8GT/s – which is only a 60% gain, but uses a more efficient encoding for data for further performance. Every 128-bits (16 bytes) of data is transmitted as 130-bits over the links. This has a disadvantage in that the minimum packet size is longer, but it eliminates the 25% encoding overhead from earlier versions. The net result is that each PCI-E 3.0 lane can achieve 2GB/s, or up to 32GB/s for x16 interfaces. This is particularly helpful for high performance I/O such as GPUs, SSD-based storage and Infiniband.
PCI-E 3.0 also includes a number of protocol improvements for overall system performance. Transaction hints are introduced so that PCI-E devices can read and write in caches (either I/O or CPU caches) without forcing data to be copied back to memory. There are optimizations so that a single PCI-E device can be natively shared by multiple virtual machines without relying on a hypervisor, which will reduce overhead for some workloads. There are also two new features which are intended for heterogeneous computing. Atomic operations on memory for synchronization and I/O page faulting to let PCI-E devices use swappable (rather than locked) system memory.
The Sandy Bridge-EP I/O hub (IOH) is now on-die, which will further boost performance and efficiency. High-end models have 40 lanes of integrated PCI-E 3.0 in socket R, while the cost optimized socket B2 versions will have 24 lanes. All models have 4 lanes of DMI 2, which is essentially PCI-E 2.0 with certain proprietary extensions. One chip in every system must connect to a southbridge over DMI2, but the others can re-use the interface as a 4-lane PCI-E 2.0 port.
Previously, all I/O traffic was sent from the processor over QPI to the I/O hub and then to the actual PCI-E devices. Removing the external QPI link to the I/O hub reduces power consumption and latency and enables other optimizations.
Quick Path 1.1
The first generation of processors using the QuickPath Interconnect included the Nehalem, Westmere and Tukwila families and debuted in 2008. To adjust to changes in the industry and system architecture, Intel has announced a second generation QPI 1.1 with numerous improvements at the electrical, logical and protocol levels. Sandy Bridge-EP and the Romley platform will be the first products to use the updated version.
Sandy Bridge-EP has 2 full-width QPI 1.1 links that operate at 8GT/s or 16GB/s in each direction. This modest 25% boost over existing 32nm microprocessors is largely due to the electrical changes in QPI 1.1, such as receiver equalization. Future iterations may ramp the transfer rate up to 9.6GT/s, but that probably will not come till the 22nm generation at the earliest. This means that 4-socket systems will not be fully connected – one socket will always be two hops away and suffer from worse latency and bandwidth. It is theoretically possible that the two QPI links can be split into four half-width ones. However, that is unlikely since there are only 3 other sockets in a system – wasting a quarter of the bandwidth seems like a bad idea.
When comparing to other systems it is critical to note that I/O traffic is no longer routed over QPI at all. The I/O hubs (IOH) for the current generation of servers all have 36 lanes of PCI-E 2.0, which translates into 18GB/s of bandwidth in each direction. Most of the traffic across QPI should be coherency – snoops, acknowledgments, invalidations and data responses. But any data sent to or from I/O devices will also travel over QPI – at the expense of coherency bandwidth. To put that in context, a saturated 10GBE interface is roughly 10% of a QPI link and a modest RAID array with 4 HDDs could reach 5%. Removing most of the I/O traffic from QPI could yield an effective bandwidth increase of 10-15%, beyond the straight increase in transfer rates.
Sandy Bridge-EP also features the new L0p power state. According to a discussion at IDF, gearing down to half-width saves about 1W. Since the QPI links probably cannot be split in half, there is no real need to have quarter-width operation. Sandy Bridge-EP is not intended for the highest reliability systems and does not need double failover for QPI links and the incremental power savings are fairly small. Altogether, the benefits of quarter-width operation do not appear to be particularly compelling and there is little reason to believe that Intel would expend the effort.
While Sandy Bridge-EP must use the new home snooping coherency protocol in QPI 1.1, there are still some questions outstanding. In particular, the capabilities of any coherency directories are unknown. Intel’s high-end servers currently use I/O directories to reduce the coherency traffic to any IOHs in the system. Given that Sandy Bridge-EP essentially has an IOH in every socket and is supposed to scale to 4-sockets, similar techniques are quite logical. In fact, extending the directories to track caching agents (i.e. other processors and not just I/O devices) would be great for 4-socket servers, since they are not fully connected.
AMD’s servers based on Magny-Cours and Bulldozer already use snoop filters and have shown some impressive benefits in terms of memory bandwidth. A full directory for Sandy Bridge-EP would eliminate that as a competitive weakness.
Discuss (104 comments)