The Philosophy of Bulldozer
AMD first managed to break into the server market in 2003 with the K8, thanks to the 64-bit extended instruction set and system architecture choices such as an integrated memory controllers and on-die interconnects. Intel certainly helped out as well, since the Pentium 4 was a distinctly unfriendly product for servers. AMD eventually hit an impressive 25% market share, and over 50% share in the lucrative 4-socket server market by 2005/6. The key to AMD’s success was that they provided exactly what consumers wanted (x86-64 and good server performance), while Intel was distracted with Itanium and the P4. In essence, they found an area where Intel could not (or would not) focus, and then put all their efforts into addressing customer needs and were able to change the rules of the game.
By mid to late 2006, AMD’s server fortunes were on the decline as Intel launched the 65nm Core 2 Duo (a dual core), and a multi-chip package (MCM) that paired together two chips to provide 4 cores in total. AMD’s 65nm, 4-core Barcelona (and later derivatives) improved the competitive situation somewhat, but was not enough to really change the overall direction of the market. This reversal accelerated in 2008, when Intel launched the 45nm, 4-core Nehalem, their first server microprocessor to use the so called Quick Path Interconnect – really an integrated memory controller and an on-die coherent interconnects. With that change in system architecture, AMD’s last major advantage was gone and Intel had much higher performance at comparable or better power efficiency. AMD’s response in mid-2009 was the 45nm, 6-core Istanbul – which did not quite manage to achieve parity. In early 2010, Intel completed their return to server ascendancy with the 32nm, 6-core Westmere and the 45nm, 8-core Nehalem-EX, which targeted 4-socket designs and was the first high-end server part with the Quick Path Interconnect. AMD’s failure here was largely a result of trying to match Intel’s superior manufacturing and resources head-on.
In the earlier part of 2010, AMD launched Magny-Cours, which pairs together two existing 45nm, 6-core CPUs in a single MCM. For highly threaded and parallel applications, Magny-Cours is efficient and affordable. For instance, it is very well suited for certain portions of the HPC market. However, applications which favor single threaded performance will inevitably do best running on an Intel-based solution. Unfortunately, differentiating solely based on the number of cores has a cost – die area. From a manufacturing stand point, each Magny-Cours uses a pair of 346mm2 chips, compared to a single 246mm2 die for Westmere and 684mm2 for Nehalem-EX.
Philosophically, Bulldozer seems to learn from the lessons of the past decade. AMD is stepping back from the pursuit of single threaded performance, to emphasize throughput. The cores are not lightweight, as with a GPU or with Niagara; the single threaded performance should actually be higher than the previous generation Magny-Cours and comparable to current Intel designs. However, in determining project goals for Bulldozer single threaded performance was consciously sacrificed to meet what the team determined was a more optimal overall design point. This stands in contrast to Intel, where single threaded performance is still the first and foremost design target for designs like Sandy Bridge. This is acknowledgement that AMD cannot beat Intel on single threaded performance, and it would be a repetition of the last 3 years to attempt such an endeavor. Instead, they are trying to change the rules of the game, by focusing on the core count and highly parallel server workloads.
Bulldozer is the first x86 design to share substantial hardware between multiple cores, in some cases blurring the traditional notion of a core. Current x86 designs share the last level cache, power management and external interfaces such as the memory controllers, coherent interconnects and other I/O between 2-8 cores. Bulldozer is a hierarchical design with sharing at nearly every level. Each module or compute unit (i.e. a pair of cores) share an L1I cache, floating point unit (FPU) and L2 cache, saving area and power to pack in more cores and attain higher throughput – albeit with a slight cost in terms of per-core performance. All modules in a chip share the L3 cache, Hypertransport links and other system components.
While sharing the front-end and FPU seems radical compared to today’s x86 designs, it is a natural evolution in the multi-core era. The front-end has to deal with a lot of the complexity of the x86 instruction set, which leads to power hungry decoders and large structures like microcode. Floating point hardware also consumes a great deal of area and power and is rarely utilized over 40% – so sharing between two cores is an excellent way to gain back area and power with a minor performance loss. In many ways, AMD is positioning this high degree of sharing as an alternative to multi-threading, which is used by almost every other high performance CPU; and there is some truth to this claim.
Bulldozer itself is somewhat of a contradiction – it is a substantial departure from the previous generation Istanbul, yet in most parts of the design, it is also clearly descended from AMD’s previous work.
Bulldozer is a high frequency optimized CPU, a so called speed demon. This approach has fallen out of popularity in the x86 world, due to Intel’s misadventures with the Pentium 4. In all fairness though, many of the Pentium 4’s problems were unrelated to high clockspeed and more closely tied to the actual microarchitecture. In the high-end server world, IBM has successfully pursued high clock speeds with the POWER7. So a speed demon approach can work out successfully. Bulldozer has a fairly lengthy pipeline, to minimize the gate delays per stage. AMD was unwilling to share any specifics on gate delays, although some discussions at comp.arch suggest a target of ~17 gate delays vs. ~23 for Istanbul. To tolerate the increased latencies necessary for a high frequency target and to efficiently share resources between cores, Bulldozer introduces decoupling queues between most major stages in the pipeline.
This article describes in detail the architecture and pipeline of the Bulldozer core a 64 bit, 4 issue super-scalar, out-of-order MPU with 48 bit virtual and physical addressing. The first Bulldozer based product, Interlagos, is an 8-core (4 module) design implemented in Global Foundries’ high performance 32nm SOI process, using high-K gate dielectrics and metal gate stacks, with a gate first approach. Two chips may be packaged together in a single MCM to achieve as many as 16 cores in a single socket. Bulldozer is compatible with all the latest Intel x86 extensions (SSE 4, AES-NI and AVX), and also the AMD XOP, FMA4 extensions and the Light Weight Profiling specification.
Since Bulldozer is not expected till 2011 (most likely in the second half), AMD was reluctant to share product level details. Instead they focused on describing the core at Hot Chips, and avoided mentioning many details pertaining to other portions of the chip. Even for some features within the core, they were unwilling to disclose substantial details – especially in the front-end.