Intel’s Quick Path Evolved

Pages: 1 2 3

Coherency Protocol

The biggest change in QPI 1.1 is in the cache coherency protocol itself. QPI 1.0 was designed for two different protocols. The most common is source snooping with the MESIF protocol, as described in our previous Quick Path Interconnect article and used in most x86 systems. Less common is a home snooping protocol that is found in all Itanium systems and some x86 servers that use third party node controllers (e.g. IBM’s X5). Incidentally, AMD’s MOESI protocol also uses home snooping.

In source snooping, the requesting processor (that missed in the L3 cache) broadcasts a snoop request to the entire system. Other caching agents (i.e. anything with a cache, such as another processor) may fulfill the snoop request if they hold a cached copy of the data in certain states (e.g. M, E or F). The home agent (i.e. the memory controller that owns the data) will respond to the snoop with a clean copy of the cache line if necessary. The home node still receives all of the acknowledgements from the caching agents and if a conflict occurs, will resolve the transactions in the correct order.

Home snooping has three phases, instead of two and moves the snooping to the home agent instead of the requester. First, the requesting processor sends a request to the home agent. Second, the home agent will send a snoop broadcast to the caching agents in the system and possibly begin reading the cache line from memory. Lastly, the home node and/or any caching agents will send data to the original requester. In some ways this is simpler, since the coherency management clearly resides with the home agent.

Source snooping is lower latency, especially when the requested cache line is held in remote memory and a remote cache. This is most common for workloads that have no NUMA awareness. The benefits are greater if accessing data in a cache is substantially faster than memory. However, home snooping is a more natural fit for inter-socket snoop filtering and directories. After receiving the request, the home agent will probe the directory (or snoop filter) and only send snoop requests to the caching agents that have a copy of the data. Home snoop protocols using directories tend to scale better, because snoops are only sent to caching agents that hold the requested data and thus consumes less bandwidth across QPI.

Intel’s studies for the first generation of QPI showed that source snooping was generally faster for 1-2 socket systems, equally fast for 4-socket systems, while home snooping was better for anything larger.

However, those studies were done back in 2004-2006 timeframe. Changes in the industry such as the trend towards multi-core, greater integration and better NUMA support in mainstream operating systems have altered the playing field. With each additional core in a system, the amount of snoop traffic grows, since each core will have its own steady stream of cache misses. This places a greater burden on the QPI links and the last level caches, which act as snoop filters. Directories are also becoming more relevant, since they are used to avoid probing I/O devices, and future server products will have integrated I/O.

Based on these changes, subsequent studies showed that there was no longer an advantage for source snooping in 2-socket systems. As a result, QPI 1.1 is solely a home snooped protocol and there is no longer any support for source snooping. An additional benefit is that Itanium will fully leverage the x86 ecosystem since the two product lines now share the same home-snooped coherency protocol.

Conclusions

The first generation of Intel’s Quick Path Interconnect dates back as early as 2002, and the majority of the work was done starting in 2004. QPI 1.0 was a massive step forward over the ancient front-side bus architecture that Intel plaforms used from 1995-2008, and finally caught up and exceeded AMD’s HyperTransport. The next generation Quick Path Interconnect 1.1 is largely an incremental improvement at the physical and logical layer, but a substantial change in terms of coherency protocol.

The physical layer has been tuned with receiver equalization to achieve higher frequencies than the current 6.4GT/s, and future generations will adopt adaptive equalization for even higher performance. There is a new L0p power state that operates a QPI link at reduced bandwidth and is vastly more efficient for a low utilization workload than existing alternatives. For instance, a situation where 3 cores in an 8-core socket are active could only save power by shutting the links down entirely for a short period of time with QPI 1.0. But that would impose a significant latency penalty on the remaining traffic stream. The new L0p state can save power without any penalties and is a much better fit for light workloads that are continuously active.

The major change is shifting exclusively to a home-snooped coherency protocol for QPI 1.1, whereas previously x86 systems were largely source snooped. This unifies the coherency approach for x86 and IPF, achieving better re-use of design and validation. More importantly though, QPI 1.1 sets the stage for x86 systems that can efficiently scale to larger systems, using a directory based protocol.

It is quite likely that Intel or third party server chipset companies will adopt more sophisticated snoop filtering or directory techniques to scale up future systems using QPI 1.1. This would be particularly advantageous for servers with 4 sockets or more, where coherency bandwidth is one of the biggest factors in scaling. For example, SGI’s Altix has its own custom designed node controllers with a directory based coherency protocol that extends QPI over 256 sockets. Looking at AMD, they also use a snoop filter to reduce coherency traffic and improve latency on their Magny-Cours systems. In part, because Magny-Cours is two dice in a package and from a coherency standpoint, so a mainstream 4-socket system looks like an 8-socket server from a coherency stand point.

The first proof of the benefits of QPI 1.1 will come with Sandy Bridge-EP, which is primarily designed for 2-socket servers, but can scale further. The 4-socket models will demonstrate how the coherency protocol improves scaling and give a preview of what to expect in larger systems based on Ivy Bridge-EX and other future generations.


Pages: « Prev  1 2 3  

Discuss (32 comments)