Without a doubt, the biggest change in Sandy Bridge-EP is the 40 lanes of integrated PCI-Express 3.0. In all previous server platforms, the discrete I/O Hub was connected to the processor through QPI, wasting coherency bandwidth. This also meant that many systems did not necessarily scale the I/O performance up or down with the number of processor sockets. In aggregate, Sandy Bridge has 80GB/s for I/O and 80GB/s for inter-processor communication over QPI 1.1; in constrast Westmere-EP shares 64GB/s of QPI bandwidth for both I/O and coherency. The tighter physical integration in Sandy Bridge-EP reduces I/O latency by around 15-30%, compared to the prior generation. The I/O controller has a more advanced APIC and can configure one of the 16 lane PCI-E ports for non-transparent bridging.
Eliminating the discrete I/O Hub substantially reduces the overall system power; each IOH is rated for around 20W. In addition, the I/O power management is considerably more sophisticated. The 6.4-8GT/s QPI lanes can drop down to the half width L0p state to save power (but not for reliability) in lightly loaded scenarios. Although, this was described in the original QPI development, it did not appear in the first generation. The PCI-E lanes have the L1 power saving state, which dynamically shuts down the link to save the majority of the power, and any unpopulated lanes will be power gated off.
Integrating the I/O into the same die as the LLC paved the way for more intelligent techniques. Remote prefetching I/O data into caches was an integral part of the original design of QPI and improved performance for networking. Sandy Bridge-EP moves even further in this direction. The I/O controller acts like a coherency agent and can allocate up to 10% of the LLC.
The I/O controller can substantially reduce the latency and improve performance and power efficiency by taking advantage of the cache. In previous Intel systems, I/O data was always sent through memory, which involves quite a bit of coherency overhead. Outbound data is assembled by a core in the cache, but must be evicted to memory and then read by the I/O device (and DMA reads must snoop both memory and cache). Similarly, inbound data is written by the I/O device into memory, which the core must fetch and place into the cache before consuming.
Figure 1. Outbound Data Flow
Intel’s Data Direct I/O (DDIO) bypasses memory entirely and essentially lets DMA transactions use the cache. It is entirely software and hardware transparent, thus supported on any system. Outbound data is simply placed in the cache, and then directly read by the I/O device. As shown in Figure 1, this eliminates the eviction to memory, the DMA snoops and the DMA read accesses the cache instead of memory. For inbound data, the flow is equally simple. The I/O device writes to the cache, and the core will then read the data from the I/O allocated cache region. This replaces a DMA write to memory with a DMA write to the cache, and the core will also avoid a memory read, when it hits in the cache. The main application is networking and this aligns very well with both Infiniband, but also 10Gbit Ethernet, which will be a key element of the Sandy Bridge-EP platform.
A simple 64B L2 forwarding tests showed that a single 10GBE port generates around 600-700MB/s of memory traffic, but this is entirely eliminated when using DDIO. The L2 forwarding performance improved by around 20% on lightly loaded systems and nearly 80% when using 8x10GBE. Other tests indicated that DDIO can reduce system power by around 8W per 10GBE port for L3 forwarding and 2W per port for virtualized workloads. Generally the benefits are largest for 64-128B packets, and decrease for larger data payloads.
Discuss (17 comments)