Pages: 1 2
Interpreting the Data
The bottom line is that the benchmarks for Interlagos are missing vital context that make any sort of precise inference difficult. There is critical information absent about the benchmarks used, the system configuration and even the Interlagos engineering samples. Given all the circumstances and considerations mentioned previously, it is clear that these benchmark scores are a lower bound on performance. Real systems are practically guaranteed to achieve better results. Despite these challenges, the data is worth investigating further – both to understand what it says about Bulldozer and also grasp any problems in the data itself.
To start with, the obvious problems in the memory subsystem mean that any workloads with memory traffic will yield spurious data. So the only tests that are likely to retain any value are those that are primarily cache resident – preferably in the four 2MB L2 caches. In general, the higher the percentage of memory accesses that hit in the cache, the more meaningful the results should be. Of the tests published, C-Ray should fit this profile. There are two data sets for the tests in SciMark2, large and small – the latter will definitely be cache resident, while the large set may not. It is unknown which data set was employed for the test in question. The size of Himeno’s data set is unclear, and the parallel bzip2 benchmark is for a 256MB file, but still may satisfy most memory accesses from the caches due to blocking.
With those caveats, we can proceed to examining the data, shown in Table 1. The first set of columns show the measured performance for the 1.8GHz, 32-core Interlagos system and the closest comparable. For most of the benchmarks, we used a 1.9GHz 24-core Magny-Cours system as a reference; but Himeno and pbzip2 (in italics) use a 1.9GHz 48-core system as a baseline. The second set of columns shows performance per core (we count each Bulldozer module as a pair of cores). For benchmarks where performance is measured in run time, this is the run-time multiplied by the number of cores – i.e. the total compute time. For benchmarks where performance is expressed as a rate (e.g. MFLOP/s), this is simply the rate divided by the number of cores. The third set of columns is performance per clock cycle – basically the second set of columns divided by the clock speed. If the two systems were using the same binaries (which they are not), then the third set of columns would be equivalent to instructions per cycle (IPC). The last column is the relative performance/cycle for Bulldozer, compared to Magny-Cours. A value over 1 indicates that Bulldozer is faster per cycle, while anything below 1 shows the reverse.
Table 1 – Bulldozer and Magny-Cours Performance
The most striking thing about the results is the sheer variation. A Bulldozer core is anywhere from 0.6X to 1.3X the performance of Istanbul. In some ways, this is to be expected. Bulldozer’s novel shared FPU has half the execution resources per core of Istanbul for traditional x86 code compiled without the new fused multiply-add (FMA) instructions. Workloads with a high percentage of floating point operations may cause contention and reduce performance relative to the previous generation. However, the cache bandwidth per core in Bulldozer is substantially better and should do well for workloads with many loads and stores. Bulldozer will also be better for applications that have an uneven mix of addition and multiplication, since the pipelines can execute either type of floating point instruction. What is surprising though, is the degree of the variation – nearly a 2X difference between the best and worst cases.
Looking at some of the individual results is even more startling. Given the microarchitectural characteristics discussed above, Bulldozer should fare worst on workloads with a high ratio of floating point to memory operations – yet the data indicates the opposite. Bulldozer does particularly poorly on the FFT and sparse matrix multiplication tests, which are fairly irregular and tend to favor memory accesses over raw computation. In contrast, Bulldozer is essentially the same speed for the dense matrix factorization, which heavily relies on arithmetic. Bulldozer’s dense matrix factorization performance suggests that some of the code may have actually been written to use FMA; although there is no way to tell without actually inspecting the binaries or the compiler flags.
One potential explanation for the poor FFT performance is that the highly irregular data access patterns may rely on shuffling. Each Istanbul core has two shuffle pipelines, while Bulldozer shares a single shuffle pipeline between two cores. Even if this is the case though, it still does not explain the discrepancy between the sparse and dense matrix performance.
Altogether, the results present a far from clear picture. Some of the tests show a significant performance advantage for Bulldozer – which is impressive, given that the focus was on increasing frequency more than IPC. Yet others show dire regressions reminiscent of the P4. Worse still, the benchmark data seems inconsistent at times. Perhaps most frustrating is that the sheer quantity of missing information makes it very difficult to reconcile the importance of the results. Compiler optimizations alone could easily change performance by 30% in some cases, and libraries can have a bigger impact still.
Given all the uncertainty surrounding the benchmark data, there are no hard and fast conclusions. The quality of the data only lends itself to vague and general impressions. Overall, the data suggests that for a number of applications, Bulldozer will have comparable IPC to its predecessor; sometimes better, sometimes slightly worse. Yet at the same time, the data also implies a very real risk that some workloads may hit particular bottlenecks in the architecture and suffer greatly. This seems to weakly support AMD’s claims that Bulldozer will not substantially reduce IPC in favor of frequency. For more concrete answers about Bulldozer performance, we will have to wait for more information, preferably in the form of real benchmarks or real products or ideally both.