At ISSCC 2010, AMD disclosed considerable information about Llano – AMD’s other CPU manufactured on Global Foundries’ 32nm SOI process. Llano will have a dynamic voltage and frequency scaling (DVFS) system and core level power gating. Thus it should come as no surprise that these features also made their way into Bulldozer implementations. The actual techniques for Bulldozer will probably be described at next year’s ISSCC, but for now, we can intelligently speculate that any technique used in Llano may show up in Bulldozer, possibly with improvements.
Llano’s DVFS is described as relying upon a digital activity measurement approach. This family of techniques was pioneered by Intel’s Fort Collins circuit design team for Tukwila, several members of which subsequently left Intel to join AMD. The approach in Llano samples 95 performance counter signals that are closely correlated with switching capacitance, using scan chain-like hardware. These samples are used to estimate dynamic power within 2% accuracy. The power estimates in turn are used to dynamically increase frequency based on available power headroom. It is expected that Bulldozer will use similar techniques, although perhaps enhanced to take advantage of certain microarchitectural differences and further refined with additional design effort.
Llano is also the first announced CPU from AMD to use power gating. The power gates are implemented as a footer ring of NFETs around the periphery of a core and L2 cache, using the package plane as a virtual ground. While Bulldozer’s hierarchical and shared microarchitecture improves area efficiency and throughput, it does complicate power gating. Conceptually there are five major circuit regions of each Bulldozer module – the shared front-end, the two integer cores, the floating point cluster and the L2 cache. Unfortunately, if a single core is active all of these regions (perhaps save the other integer core) must be active. The benefit of power gating a lone integer core is not worth the complexity of the implementation problems, especially since the operating system scheduler should be power-aware. As a result, Bulldozer’s granularity of power gating is at the module level. Each Interlagos die incorporates at least 4 power gates, one for each Bulldozer module. In a server, all the memory controllers must stay active to service any cache misses from other chips. Thus it is an open question whether there are substantial benefits to power gating the entire northbridge – i.e. L3 cache and other shared components such as interconnects and memory controllers. The server oriented Nehalem only power gated the cores, while Westmere (intended for desktops and notebooks) had a separate power gate for the uncore/northbridge.
Figure 7 – Bulldozer and Westmere Microarchitectures
For AMD, Interlagos – the first Bulldozer implementation – is their next chance to regain parity or retake the lead in the server market with Intel. The philosophy behind Bulldozer is to step back from trying to go toe-to-toe with Intel, which is good, since that approach has never worked for AMD. Instead, Bulldozer focuses on a more server-centric set of goals and tries to re-evaluate various trade-offs in that context. Figure 7 above is a comparison of Bulldozer’s microarchitecture to the more conventional Westmere that shows some of the differences between the two.
The Bulldozer architecture is fairly innovative, especially by the standards of x86 CPU designs. The hierarchical sharing and high frequency design should help AMD achieve higher performance per mm2 of silicon – this is necessary for success, since Intel is roughly 12-18 months ahead of AMD’s manufacturing partner. The ultimate question is how does the novel architecture in Bulldozer translate into performance (both single threaded and multi-threaded), power and die area. Most of the physical characteristics of Interlagos, such as frequency and die area, are unknown. This leaves a great deal of uncertainty, as performance is highly dependent upon frequency as well as many of the details of the architecture which have been withheld for competitive reasons. The biggest questions about Bulldozer are the frequency, branch prediction, various queues and buffers in the front-end, handling of 256-bit AVX instructions, coherency protocol and northbrige microarchitecture. Over the next year, AMD will incrementally release more information – at their analyst day, ISSCC and other venues. These details should give a fuller picture of Bulldozer and what to expect from productizations such as Interlagos. In the meantime, the team behind Bulldozer deserves congratulations for nearing the finish line on the most novel and complicated design in AMD’s history. We look forward to seeing the results next year.
 Butler, Mike. “Bulldozer” A new approach to multithreaded compute performance. Hot Chips XXII, August 2010.
 GCC Mailing list discussion. http://gcc.gnu.org/ml/gcc/2010-06/msg00402.html
 Open64 config_cache_targ.cxx http://svn.open64.net/filedetails.php?repname=Open64&path=/trunk/osprey/common/com/x8664/config_cache_targ.cxx
 Comp.arch discussion. http://groups.google.com/group/comp.arch/browse_frm/thread/45018bf3214f6049?hl=en#
 Interview with Mike Butler, Chuck Moore, Gary Silcott.
 Jotwani, R. et al. “An x86-64 Core Implemented in 32nm SOI CMOS,” Proceedings of International Solid State Circuits Conference, pp 106-107, February 2010.
 Conway, P. et al. Blade Computing with the AMD Opteron Processor (“Magny-Cours”). Hot Chips XXI, August 2009.