Pages: 1 2
Benchmark Caveats
With AMD’s fortunes on the wane and their executive suite in turmoil, many hope that a new generation of products will help stabilize the company. Certainly, AMD’s server products are in a dire competitive position, continuing to lose market share against Intel. The upcoming server products in the Orochi family are based off the novel Bulldozer microarchitecture on a 32nm SOI process. As described in an earlier article, Bulldozer is a fairly radical departure from the conventional x86 world. Each module includes a pair of integer cores which share a front-end, floating point unit and a 2MB L2 cache. Orochi (the actual silicon die) includes 4 such modules, 8MB of shared L3 cache, four Hypertransport 3.1 links and two channels of DDR3 memory. Interlagos is an MCM variant that packages two Orochi dies together; it will double the core count, L3 cache and memory controllers, but substantially reduce the frequency to stay within thermal and power limits. Given the unique nature of this microarchitecture, there are considerable questions regarding performance and whether Bulldozer will be able to compete with Intel’s 32nm Sandy Bridge family.
Last week, some benchmarks of an early Interlagos system using engineering samples emerged online. There has been quite a bit of discussion, as consumers and competitors alike are eager to get a good understanding of Bulldozer’s performance. It is only natural that people have attempted to divine future performance based on these initial results. Several people have asked me about my opinion on the data, and what it means. The short answer is that it is very difficult to conclude anything about Bulldozer, because the data is simply not useful for most people. We will explain why the benchmarks, software, system under test and the CPU itself all make the data fairly unreliable and the conclusions that can be drawn about Bulldozer.
The first problem is that the initial Bulldozer products are aimed at the server and high-end desktop market. Most desktops use Windows, as do many servers – so using Linux makes it difficult to draw conclusions about desktop performance. Moreover, few of the benchmarks were actually server workloads. One or two might have been relevant for high performance computing, but they were a far cry from typical HPC workloads. The most prominent server benchmarks – SPECint, SPECjbb/SPECpower_ssj, TPC-C/TPC-E or SAP are totally different from the tests that were run on the Interlagos system, so there is very little insight on the mainstream server world.
The software is also problematic because it’s unclear what was being executed. Benchmark groups such as SPEC have long realized that compiler settings and run time options must be disclosed in order to create a full picture of performance. The Interlagos benchmarks were compiled with GCC 4.5.2 – which is generally inferior compiler to commercial products from Intel, Pathscale or the Portland Group. Moreover, the compiler options are unknown – there’s a big difference in performance between O1 and O3 optimization. It’s also questionable whether the benchmarks were appropriately vectorized and using AVX as they should. In total, the software configuration is largely unknown, but will probably underestimate Bulldozer’s performance.
Turning to the system under test reveals an even greater number of complicating factors. The microprocessor used for these tests was clocked at 1.8GHz. AMD has yet to disclose the frequency targets for Interlagos, but it is safe to assume that it will match or exceed the 45nm 12-core Magny-Cours products, which hit2.3GHz at launch. On top of the frequency differences, it is unknown whether the dynamic voltage and frequency scaling (DVFS) in Bulldozer was enabled. Power management features such as DVFS are very complicated and may not be turned on in early engineering samples. The frequency gains from DVFS are entirely workload dependent (and may vary by product SKU). AMD has stated that with 16 cores active, Interlagos can increase frequency by 500MHz, for at least some workloads. Without knowing more, it is hard to get a precise number for frequency gain, but an average gain of 200-300MHz for the benchmarks in question sounds plausible.
The system architecture also plays a huge role in performance, especially for programs that operate out of the caches and have a significant component of coherency and memory traffic. The STREAM triad benchmark showed particularly poor memory system performance for Interlagos: ~6GB/s per socket, compared to ~27GB/s for Magny-Cours. This strongly suggests that the northbridge and memory controllers were poorly configured or running at fairly low frequencies, or both. A nearly 4X drop in bandwidth (versus an expected 20% gain) immediately cast doubt on the validity of any benchmarks that had substantial memory traffic. Moreover, if the 2x8MB L3 caches (which are contained in the northbridges) are operating at fairly low frequencies, that will reduce performance, even for benchmarks that barely touch memory.
The poor memory performance suggests that AMD’s probe filter was probably disabled, which implies substantially more coherency traffic for each memory request and much higher latencies. The probe filter is also one of those features that is lower down on the priority list and may not be enabled in early revisions. Furthermore, the external interfaces in the Interlagos system were not described in the testing. The actual DDR3 memory used in the system is unknown. Interlagos products will definitely support 1.6GT/s, but the tested system could be using slower memory at 1.33GT/s or as low as 0.8GT/s. Similarly, the Hypertransport 3.1 links in Interlagos can run at 6.4GT/s. However, the Supermicro H8DGU motherboard has HT 3.0 links, which max out at 5.2GT/s and could have been operating even slower. The net result is that the questions regarding the system architecture and configuration create considerable uncertainty. The only obvious conclusion is that the problems with memory performance will skew the benchmark results in a negative fashion.
The last set of issues pertains to the fact that the benchmarks were run on an Interlagos engineering sample, rather than production silicon. Benchmarks on engineering samples are not necessarily inaccurate, but it depends on the relationship between the sample and final silicon. In some cases they may be very similar; but often, samples are different than production. Any number of performance issues could have been present in the sample, but fixed in later steppings and final products. As a simple example, the infamous Barcelona TLB bug cost nearly 10-20% in performance, but was easily fixed with a silicon respin. Conversely, some particularly egregious problems are only discovered late in validation and require fixes so that samples end up outperforming actual products. For instance, an electrical problem was found in the 180nm Itanium 2 after a year in production that required reducing the frequency from 1GHz to 800MHz. Since Bulldozer is an x86 design, it also has microcode, which introduces yet another variable. Microcode is incredibly powerful and can over-ride the default behavior of a CPU to change how instructions, various conditions and bugs are handled with commensurate impact on performance and correctness. Since microcode is essentially software, changes can occur very late in the design cycle for a given chip – even after the chip has taped out and entered manufacturing. Without knowing the age of the engineering sample, it’s impossible to say how much additional time AMD has to refine Interlagos, but it should be at least 3-6 months and possibly longer.