Poulson: The Future of Itanium Servers

Pages: 1 2 3 4 5 6 7 8 9

System Architecture

From its conception, the goal of Itanium was to address the entire server and workstation market – from HPC to mainframes. In contrast, notebooks and desktops make up the overwhelming majority of x86 microprocessors from AMD and Intel. While x86 designs have grown up and can tackle most of the workloads meant for Itanium, they have stayed true to their roots. There is a limit to how much additional hardware Intel and AMD can put into mainstream x86 designs, without compromising the volume economics. The system architecture for Itanium has a much greater focus on system scalability and reliability. As Figure 7 shows, both Tukwila and Poulson have more QPI links than Westmere-EX for scalability.

Poulson is socket compatible with Tukwila and relies on a similar system architecture. Both processors use a variant of the QuickPath Interconnect found in x86 designs, which is tuned for scalability and reliability. All x86 microprocessors rely on snoop-based cache coherency; whenever a core misses in the last level cache and reads from memory, it must also send a request to the caches in all other sockets to check for copies of the cache line. Snooping is very low latency for 1-4 sockets, but is inefficient for larger systems.


Figure 7 – Poulson System Architecture and Comparison

In contrast, Tukwila and Poulson have a directory-based coherency protocol that scales much better. For every cache line, the directory lists which cores have a copy. When a memory access misses in the L3, it first checks the directory to determine which other cores have the cache line and whether it should get the data from memory or another cache. Either way only a single request and response are sent, compared to N requests and N responses in a snooping system. Checking the directory adds a small bit of latency, but for 4 or 16 socket system, the bandwidth savings are huge. To accelerate the whole process, Tukwila and Poulson also include specialized caches for the directory.

While Poulson and Tukwila are highly similar from a system perspective, there are several changes that substantially improve overall performance. The most important enhancement concerns Poulson’s L3 cache, but there are a number of tweaks to other areas.

In Tukwila, each of the four cores has a private L3 cache that services all misses from the L2D and L2I. Each L3 is 6MB, 12-way associative with 128B lines and a minimum 15 cycle latency. The L3 generally gives priority to instruction requests over data requests, since instruction misses will stall the core pipeline and prevent forward progress. The L3 is a write-back design with ECC protection and is neither inclusive nor exclusive compared to lower level caches.

Tukwila’s system interfaces include 4 full-width QPI links for coherency, 2 half-width links for I/O and 2 home agents, which are responsible for maintaining a directory based cache coherency protocol. Compared to Intel’s x86 designs, such as Westmere-EX, Itanium has an extra QPI link for coherency. The QPI links all operate at 4.8GT/s, running slightly slower than Nehalem/Westmere, due to the older process and larger safety margins. The home agents each have a memory controller and 4 high frequency SMI links, which use a similar physical layer to QPI (4.8GT/s, differential). Each SMI links runs at 6X the memory frequency and connects to a DDR3 buffer chip that drives multiple DIMMs and can add more memory while keeping the same bandwidth. SMI is bidirectional and the buffer chip can simultaneously read and write to different DIMMs. While the memory interfaces are similar for x86 and Itanium, one key difference is that Tukwila’s home agents also include a 1.1MB directory cache to reduce memory latency and improve scalability. The 4 cores and 8 system interfaces are tied together with a 12-port, duplexed, crossbar switch with 8B (or 19.2GB/s) per port.

Poulson appears to re-use the L3 cache design from Intel’s existing x86 designs, such as Sandy Bridge, Nehalem-EX and Westmere-EX, and the inherited system interface architecture has been modified. The L3 cache is now shared by all cores, instead of being private, which lowers latency for sharing data and on-die communication. The L3 cache is 32MB and 32-way associative, with smaller 64B lines. Unfortunately, several details were missing including the minimum latency and whether the L3 is inclusive of other levels in cache hierarchy. Collectively the L1 and L2 D-caches are 2.1MB, so it conceivable that the L3 cache is inclusive. But with directory based coherency, inclusion is less beneficial.

The shared L3 cache is implemented similarly to existing x86 designs, but with a greater emphasis on performance and scalability. The L3 cache is partitioned into 8 slices, 4MB per core – compared to 3MB for Nehalem-EX. The slices all sit on a ring interconnect, that services all L3 requests. Poulson’s ring interconnect has two 32B wide data rings that transmit in opposite directions. Using the shortest path on either ring cuts the average latency down, and improves the usable bandwidth, compared to simpler uni-directional designs like Sandy Bridge. The cache and ring interconnect has a reported theoretical bandwidth of 700GB/s from the L3 cache, roughly 3X the 250GB/s simulated for Nehalem-EX.

Poulson’s system interface crossbar is a separate block from the L3 cache and the ring interconnect. The crossbar has shrunk down to only 10 ports, compared to Tukwila. While Tukwila required a port for each core and L3 cache, Poulson has two ports that connect to the ring interconnect – one for each direction. Poulson’s external QPI and SMI interfaces are clocked 33% faster (up to 6.4GT/s), a nice boost with the new design, and it is likely that the crossbar bandwidth scaled up similarly. The directory caches expanded slightly from a total of 1.9MB to 2.2MB. The SMI links have a peak usable (i.e. after ECC overhead) memory bandwidth of 51.2GB/s and 512GB capacity per socket. The communication bandwidth over QPI is an impressive 160GB/s – 128GB/s for coherency traffic to other processors and 32GB/s for I/O devices.


Pages: « Prev   1 2 3 4 5 6 7 8 9   Next »

Discuss (208 comments)