Core Microarchitecture Performance: Woodcrest Preview

Pages: 1 2 3 4 5 6 7 8 9 10

Cinebench 9.5 (64 bit)

Cinebench is one of the more popular software rendering tests. It is free, and runs on multiple platforms: Windows 32b, 64b and OS X. We will be using the CPU test, which renders a scene using a ray-tracing engine. The ray-tracer operates in both a single threaded mode and an SMP aware mode, but lacks fine grained control over threading and process count. The Dempsey system runs using 8 threads, and Woodcrest uses 4 threads. The benchmark is designed to run out of memory, so there should be no waiting for disk I/O.

While the documentation indicates an 80-90% speed up for the second CPU, and around 10-20% for multithreading on a Pentium 4 Xeon, we have found in the past that the benchmark does not scale quite as well from two to four processors.

The benchmark is normalized to the performance of a 1GHz Pentium 4 system, which has a score of 100. So the single threaded Woodcrest is approximately 5x faster than a 1GHz Pentium, while the Woodcrest is roughly 16x faster in SMP operation.

Figure 4 – Cinebench CPU Performance

Unsurprisingly, Woodcrest leads Dempsey in single threaded performance, by about 60%; the 128b wide SSE units and 128b loads and stores are probably responsible for most of this lead. However, Woodcrest does not scale up quite as well as Dempsey. Woodcrest improves by a factor of 3.07x up to 4 threads, while Dempsey improves by 3.41x as it goes to 8 threads. Woodcrest still leads by a substantial margin (40%) under SMP mode, but this is somewhat puzzling.

This discrepancy is most likely due to Dempsey’s multithreading capabilities and Woodcrest’s bandwidth demands. Perfect scaling for Woodcrest would be a four fold improvement, while Dempsey could theoretically quadruple performance (due to having 4 cores), plus another 10-20% for multithreading. So perfect scaling for Dempsey could be any where from 4.4x to as high as 4.8x. The other issue is that in theory, Woodcrest has 2x the SSE throughput that Dempsey does. This is partially why it does so much better in single threaded mode. However, more throughput is just wasted unless data is being delivered to the core. Therefore, Woodcrest requires substantially more memory bandwidth than Dempsey does, and perhaps that is holding back multiprocessor scaling.

The standard deviations for the single threaded runs were 4.63 and 5.31 for Woodcrest and Dempsey. The multithreaded runs had slightly higher standard deviations, 7.78 and 8.75 respectively.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10   Next »

Discuss (22 comments)