The socket infrastructure for Magny-Cours was explicitly designed to accommodate future Bulldozer-based CPUs. This includes both the 1944-pin CPU socket itself and also the approach of combining two CPUs (referred to as nodes) in a single package. Inheriting the system architecture from a previous generation is wholly logical and wise decision on AMD’s part. First of all, AMD’s server volumes are relatively low. The longer a platform lasts, the greater the volume of the platform and thus partners (and AMD itself) can justify a greater investment in the ecosystem. AMD has historically tried to keep their platforms as stable as possible to encourage partner adoption, and even Intel is similarly constrained (primarily at the high-end of the market). Second, Bulldozer is already a very risky project combining a new microarchitecture, a new process technology and a new manufacturing arrangement. Re-using parts of the system architecture avoid more risk and complexity.
As AMD indicated in their presentation at the last Hot Chips, the packaging arrangement for Magny-Cours is intended to achieve the majority of the performance and benefits of a 2N-socket system, but in the foot print of an N-socket system. The draw back is that no more than 4 Magny-Cours can be connected without a node controller. In contrast, Intel’s Nehalem-EX can gluelessly scale up to 8 sockets.
Interlagos, the first Bulldozer instantiation is an 8-core device. However, since it is intended to fully inherit the socket infrastructure of the previous generation, the first products will offer up to 16 cores using an MCM. Figure 1 below shows the expected system architecture for a single Interlagos node (i.e. one die in an MCM) and the actual system architecture for a Magny-Cours MCM, Westmere-EP and Nehalem-EX. Figure 1 readily demonstrates the hierarchy in Interlagos (e.g. shared L2 caches) compared to the relatively flat topology of Intel’s offerings and the previous generation Magny-Cours. In AMD’s terminology, each pair of cores (which share a front-end, FPU and L2 cache) is referred to as a module, or a compute unit. Note that the cache hierarchies in Intel and AMD designs are quite different. Intel’s CPUs use an inclusive last level cache to act as a snoop filter, so Westmere can only cache 12MB of data total. AMD favors victim caches, which are mostly exclusively so that separate data may be held in the L2 and L3 caches. The theoretical cache capacity of an Interlagos node is estimated to be 16MB, while each node in a Magny-Cours MCM can hold up to ~9MB; in reality data will be replicated between caches within a node, and also within a single MCM, reducing the effective capacity. The overall benefits of Bulldozer’s throughput focus seem clear – Intel’s roadmaps for 2011 mostly call for 8 and 10 core products, possibly leaving an opportunity for AMD.
Figure 1 – Bulldozer System Architecture
Each Interlagos die contains two 64-bit DDR3 memory controllers. Magny-Cours was limited to operation at 1.33GT/s, or 21.3GB/s of memory bandwidth, which is the minimum expected for Interlagos. It is quite possible, even likely, that AMD will support faster operation, as there are JEDEC compliant DDR3 variants that operate at 1.86GT/s, although the fastest 2.13GT/s may be out of reach for servers. The four 16-bit wide coherent HyperTransport (HT) interfaces run at 6.4GT/s or 102GB/s of raw coherency traffic. However, ccHT has a variable length CRC algorithm for packets, thus the usable bandwidth is lower (for comparison, Intel’s fixed length CRC uses 20% of the raw bandwidth).
In AMD’s MCM, two die are paired together and connected using a full width ccHT lane and a second half-width lane. Only one of the two dice can connect to external non-coherent I/O, but both devices will connect to other CPUs in the system. In aggregate, the MCM has an impressive 42.6GB/s of memory bandwidth and 102GB/s of raw HT bandwidth – equally dividing both between the two dice. It is likely that Interlagos will have even more memory bandwidth, although how much more is hard to say.
Even though the first Bulldozer implementations are somewhat constrained by socket compatibility, there are still a number of opportunities for improvement. For instance, the coherency protocol is still undisclosed and has plenty of room for improvement. As a more concrete example, consider the snoop filter. For Magny-Cours, 1MB out of the 6MB L3 cache was repurposed for a snoop filter that tracks data from local memory that is cached elsewhere in the system. Each 64B cache line holds 16 snoop filter entries in a 4-way set associative arrangement, for a total of 256K entries that index 16MB of cache. With the snoop filter enabled, each Magny-Cours contains 8MB (or 128K lines) of usable cache (3MB L2 + 5MB L3), thus a 4-socket system could in theory have up to 64MB of cached data, although for most situations, there is substantial replication between caches.
Since Interlagos is on 32nm, rather than 45nm, it is natural to expect more cache. Interlagos probably has 16MB of cache per die, roughly double Magny-Cours. Assuming this is true, a 2MB snoop filter would be a reasonable starting point. Alternatively, AMD may have taken an entirely different approach, such as a full directory based cache coherency protocol (similar to Itanium).