Pages: 1 2
Performance Estimates and Analysis
Table 1 shows performance for the parallel SPECint_rate and SPECfp_rate tests, which focus on throughput and typically have one copy per hardware thread. Unfortunately, performance estimates for the regular test results were not available. Strangely enough, the AMD SPECfp_rate results used auto-parallelization, which is passing odd for an inherently parallel test. In addition to raw performance, we also calculated the performance normalized to the number of cores and the base frequency. Normalizing to clock frequency is a little inaccurate, since several of the CPUs (e.g. from Intel, IBM) dynamically adjust the frequency upwards. However with all threads active, Intel CPUs will only operate slightly faster than the specification allows, so the error is likely to be fairly small (under 5%). In our discussion, we also assume that Intel’s compiler changes for Sandy Bridge have not ‘cracked’ or ‘broken’ any more of the benchmark tests – this seems fair, but we will have to wait for official results to see.
The results for the top of the line Sandy Bridge are impressive. Of the desktop class chips, it outperforms all but the 6-core Westmere, undoubtedly why Sandy Bridge has yet to take over the “Extreme Edition” slot in the product line up. The raw performance is 14/10% (integer/FP) higher compared to the 4-core Westmere – good considering that the latter has 3 memory channels rather than just 2. But raw performance is perhaps the least interesting of the data in Table 1.
Table 1 – Performance Results and Estimates for SPECcpu2006
Sandy Bridge’s performance per core is nothing short of amazing, exceeding all but the POWER7 by a fair margin. Given that the 4GHz POWER7 runs close to 250W, even using IBM’s exotic packaging, and is equipped with a 32MB L3 cache and an incredible amount of memory bandwidth, that is no small accomplishment. The two aren’t even in the same league in terms of cost and power constraints.
The performance normalized to the number of cores and frequency is roughly similar to instructions per cycle (IPC). Since frequencies have stayed similar, this is perhaps the most direct measurement of the changes in the Sandy Bridge microarchitecture. The myriad improvements – the uop cache, larger re-order buffers, improved load/store units, 256-bit AVX, L3 cache – all should show up here. Compared to the high-end Westmere, Sandy Bridge is 29% faster per cycle for SPECint_rate and 43% for SPECfp_rate. For an improvement in microarchitecture alone, this is dramatic.
It is tempting to compare many of these CPUs, but it is important to be careful when the clock frequencies or core counts differ substantially. Memory bandwidth and latency are fixed, so more and faster cores will tend to have lower IPC. To illustrate this, consider the IPC for the 2-core version of Sandy Bridge, which is the best listed in our table – exceeding its 4-core cousin. Memory bandwidth per cycle is critical, especially for SPECfp_rate. Given 2X higher bandwidth per core, and similar frequencies, the lower-end design has a 30% higher IPC for floating point.
While the performance of the top bin Sandy Bridge is the most exciting, Intel has given us several data points to work with. It would be a shame to neglect this information, especially since we can use it to examine various aspects of the architecture.
First, consider the impact of simultaneous multi-threading on Sandy Bridge. Comparing the 2500 model to the 2600K, the differences are an extra 2MB of L3 cache, Hyperthreading and roughly 100MHz. The IPC for the latter is 16% higher for SPECint_rate and 5% higher for SPECfp_rate. Realistically the impact of an extra 2MB of cache is fairly small, probably under 2%. So a reasonable assessment is that Hyperthreading on Sandy Bridge has a net gain of roughly 12% for SPECint_rate and 3% for SPECfp_rate. Of course, the impact will be larger for server workloads. But this does highlight that Hyperthreading has limited utility for bandwidth limited FP applications on Sandy Bridge (e.g. many SPECfp subtests). A design with more generous memory bandwidth would probably see larger gains.
The second area to investigate is Sandy Bridge’s frequency scaling. Looking at the 2500 and the 2400, the differences are purely in the realm of frequency. The base clock for the 2500 model is 200MHz higher (6.5%), as is the frequency when all 4 cores are active. This modest frequency rise translates into a 5.2% improvement in SPECint_rate and a 2.6% gain for SPECfp_rate. While these changes are fairly small, they indicate how Sandy Bridge’s performance scales as a function of frequency. SPECint_rate is very responsive to frequency; performance increases by ~80% of the change in base frequency (i.e. increasing base frequency by 10% should improve performance by 8%). On the other hand, SPECfp_rate is far more limited by bandwidth and only has a ~40% response to frequency.
Sandy Bridge is clearly a substantial improvement in CPU performance over the previous generation. The SPECcpu2006 estimates show a preliminary picture of a design that achieves 30-40% higher per-core performance than its predecessor, a considerable feat on the same process node. Unfortunately, actual SPECcpu results – particularly for each of the sub-tests – have yet to appear. Together with the SPECcpu speed results (which measure single threaded performance), the sub-test numbers would create a much more comprehensive profile of Sandy Bridge’s microarchitecture.
Even with the preliminary estimates though, there is a lot to say about Sandy Bridge. The IPC has increased noticeably, and we estimated that Hyperthreading is worth about 12% performance for integer workloads and relatively little for floating point. Our analysis also demonstrated excellent frequency scalability on the integer side, but a bit of a memory bottleneck for floating point. In terms of the future though, the performance numbers show a clear disparity between Intel and AMD’s approaches. Intel has 50% greater performance per cycle; so AMD must make up for this through a combination of more cores and higher frequency – all without slamming into the power wall. For throughput oriented workloads, this seems feasible, but for single threaded performance, this may be impossible. The only lever there for AMD is higher frequency and Bulldozer would need to run at nearly 4.4GHz to match Sandy Bridge. But AMD has yet to release specifications, and as we noted last week – the only Bulldozer benchmarks are no indication of expected performance, so we will have to wait to see how everything plays out.