In the competitive x86 microprocessor market, there are always swings and shifts based on the latest introductions from the two main protagonists: Intel and AMD. The next anticipated shift is coming in 2008-9 when Intel will finally replace their front side bus architecture. This report details Intel’s next generation system interconnect and the associated cache coherency protocol, likely deployment plans across the desktop, notebook and server market as well as the economic implications.
Intel’s front-side bus has a long history that dates back to 1995 with the release of the Pentium Pro (P6). The P6 was the first processor to offer cheap and effective multiprocessing support; up to four CPUs could be connected to a single shared bus with very little additional effort for an OEM. The importance of cheap and effective cannot be underestimated. Before the P6, multiprocessor systems used special chipsets and usually a proprietary variant of UNIX; consequently they were quite expensive. Initially, Intel’s P6 could not always match the performance of these high end systems from the likes of IBM, DEC or Sun, but the price was so much lower that the performance gap became a secondary consideration. The workstation and low-end server markets embraced the P6 precisely because the front-side bus enabled inexpensive multiprocessors.
Ironically, the P6 bus was the subject of considerable controversy at Intel. It was originally based on the bus used in the i960 project and the designers came under pressure from various corporate factions to re-use the bus from the original Pentium so that OEMs would not have to redesign and validate new motherboards, and so end users could easily upgrade. However, the Pentium bus was strictly in-order and could only have a single memory access in flight at once, making it entirely inadequate for an out-of-order microprocessor like the P6 that would have many simultaneous memory accesses. Ultimately a compromise was reached that preserved most of the original P6 bus design, and the split-transaction P6 bus is still being used in new products 10 years after the design was started. The next step for Intel’s front side bus was to shift to the P4 bus, which was electrically similar to the P6 bus and issued commands at roughly the same rate, but clocked the data bus four times faster to provide fairly impressive throughput.
While the inexpensive P4 bus is still in use for Intel’s x86 processors, the rest of the world moved on to newer point-to-point interconnects rather than shared buses. Compared to systems based on HP’s EV7 and more importantly AMD’s Opteron, Intel’s front-side bus shows it age; it simply does not scale as well. Intel’s own Xeon and Xeon MP chipsets illustrate the point quite well – as both use two separate front-side bus segments in order to provide enough bandwidth to feed all the processors. Similarly, Intel designed all of their MPUs with relatively large caches to reduce the pressure on the front-side bus and memory systems, exemplified by the cases of the Xeon MP and Itanium 2, sporting 16MB and 24MB of L3 cache respectively. While some critics claim that Intel is pushing an archaic solution and patchwork fixes on the industry, the truth is that this is simply a replay of the issues surrounding the Pentium versus P6 bus debate writ large. The P4 bus is vastly simpler and less expensive than a higher performance, point-to-point interconnect, such as HyperTransport or CSI. After 10 years of shipping products, there is a massive amount of knowledge and infrastructure invested in the front-side bus architecture, both at Intel and at strategic partners. Tossing out the front-side bus will force everyone back to square one. Intel opted to defer this transition by increasing cache sizes, adding more bus segments and including snoop filters to create competitive products.
While Intel’s platform engineers devised more and more creative ways to improve multiprocessor performance using the front-side bus, a highly scalable next generation interconnect was being jointly designed by engineers from teams across Intel and some of the former Alpha designers acquired from Compaq. This new interconnect, known internally as the Common System Interface (CSI), is explicitly designed to accommodate integrated memory controllers and distributed shared memory. CSI will be used as the internal fabric for almost all future Intel systems starting with Tukwila, an Itanium processor and Nehalem, an enhanced derivative of the Core microarchitecture, slated for 2008. Not only will CSI be the cache coherent fabric between processors, but versions will be used to connect I/O chips and other devices in Intel based systems.
The design goals for CSI are rather intimidating. Roughly 90% of all servers sold use four or fewer sockets, and that is where Intel faces the greatest competition. For these systems, CSI must provide low latency and high bandwidth while keeping costs attractive. On the other end of the spectrum, high-end systems using Xeon MP and Itanium processors are intended for mission critical deployment and require extreme scalability and reliability for configurations as large as 2048 processors, but customers are willing to pay extra for those benefits. Many of the techniques that larger systems use to ensure reliability and scalability are more complicated than necessary for smaller servers (let alone notebooks or desktops), producing somewhat contradictory objectives. Consequently, it should be no surprise that CSI is not a single implementation, but rather a closely related family of implementations that will serve as the backbone of Intel’s architectures for the coming years.