SPECpower_ssj2008 is a staple for our server reviews. For a real submission, SPECpower requires a separate controller system that drives the system under test (SUT), and interfaces with an extremely high precision power meter. We instead ran the controller software on the system under test itself which makes little difference in terms of performance, but is nonetheless not valid. Additionally we opted for the eminently affordable Watts Up Pro (which retails for around $100), while qualified meters start at around a thousand dollars. We also omitted any temperature sensors and measurements. Since our power meter is not sufficiently accurate, we also used shorter run times (60 seconds at each performance level, instead of 240 seconds).
SPECpower_ssj2008 measures performance for server side Java, much like SPECjbb2005, but the two workloads are not comparable. The scoring works differently and SPECpower reports power consumption to boot, so it basically obviates our need for SPECjbb2005. The software tuning is much the same as SPECjbb it is a huge knob and heavily dependent on the JVM. In previous reviews, we used the Oracle (nee BEA) JRockit JVM, which is no longer freely available. Consequently, we switched to the Oracle (nee Sun) Hotspot JVM to ensure that our software stack was relatively recent and could take advantage of any new features in Sandy Bridge-EP.
Note that for SPECpower_ssj2008, we had hardware prefetching enabled in the BIOS. This typically causes conflicts with the software prefetching and can reduce performance by around 10-15%. The tuning options were slightly different for Sandy Bridge-EP and Westmere-EP. The former was provided by Intel, while the latter was used in a formal submission from Oracle:
Sandy Bridge-EP: start /AFFINITY [0000ffff, ffff0000] java –server -showversion -Xmx18g -Xms18g -Xmn16g -XX:ParallelGCThreads=16 -XX:BiasedLockingStartupDelay=200 -XX:SurvivorRatio=60 -XX:TargetSurvivorRatio=90 -XX:InlineSmallCode=3900 -XX:MaxInlineSize=270 -XX:FreqInlineSize=2500 -XX:AllocatePrefetchDistance=256 -XX:AllocatePrefetchLines=4 -XX:InitialTenuringThreshold=12 -XX:MaxTenuringThreshold=15 -XX:LoopUnrollLimit=45 -XX:+UseCompressedStrings -XX:+AggressiveOpts -XX:+UseLargePages -XX:+UseParallelOldGC -XX:-UseAdaptiveSizePolicy
Westmere-EP: start /AFFINITY [000fff, fff000] java -Xms3700m -Xmx3700m -Xns3100m -XXaggressive -Xlargepages -Xgc:genpar -XXcallprofiling -XXgcthreads=8 -XXtlasize:min=4k,preferred=1024k
Westmere-EP: start /AFFINITY [000fff, fff000] java -server -showversion -Xmx18g -Xms18g -Xmn16g -XX:SurvivorRatio=60 -XX:TargetSurvivorRatio=90 -XX:ParallelGCThreads=12 -XX:AllocatePrefetchDistance=192 -XX:AllocatePrefetchLines=4 -XX:LoopUnrollLimit=45 -XX:InitialTenuringThreshold=12 -XX:MaxTenuringThreshold=15 -XX:InlineSmallCode=5500 -XX:MaxInlineSize=220 -XX:FreqInlineSize=2500 -XX:+UseLargePages -XX:+UseParallelOldGC -XX:+UseCompressedStrings -XX:+AggressiveOpts
One of the particularly attractive features of SPECpower is that unlike SPECjbb, it targets specific utilization levels to measure power. We chose to use the standard set of 11 utilization levels active idle (where the system can accept transactions, but none are being sent by the client/controller) and every 10%, up to full utilization. To score SPECpower, the average ssj_ops over all 11 levels is divided by the average power for all 11 levels the resulting ratio is the performance to power ratio. We only took a single SPECpower_ssj2008 measurement, but the benchmark was run several times to fine tune the different parameters. The performance results were steady enough that we felt additional runs were not necessary.
Figure 3. SPECpower_ssj2008 Performance vs. Power
The figure above shows performance (in ssj_ops) on the X-axis, with power consumption on the Y-axis; so the best solution would be in the lower right hand corner, and the slope of the curve for each system shows the price (in power) of additional performance. The raw performance for Sandy Bridge-EP is nothing short of amazing, 1.43M ssj_ops versus 800K. The 80% performance advantage is even larger than what we observed for Euler3d, which is fairly surprising. SPECpower is known to be bandwidth and cache sensitive, so the larger and faster L3 is a significant factor. The more aggressive DVFS in Sandy Bridge-EP probably contributes around 5%, and the intelligent HW prefetching should reduce contention as well.
The utilization curves highlight the massive dynamic range for Sandy Bridge-EP. At the peak performance level, the power draw is 518W, versus 136W for 10% utilization and 110W for active idle. Considering that the total TDP for the E5-2690 processors is 270W, this suggests a tremendous 140W of platform level power savings. In comparison, the difference between peak utilization and idle for Westmere-EP is 167W, while the total TDP for the X5670’s is 190W. This largely reflects the extensive investment in processor and platform level power saving from Intel’s design team.
The one caveat to this comparison is that the two servers are somewhat different. The 1U Westmere system is likely to have less efficient fans and cooling, by virtue of the constrained space, in comparison to the 2U Sandy Bridge-EP. So a totally fair comparison would likely show less of a power efficiency advantage for Sandy Bridge-EP, but still a substantial improvement.
Figure 4. SPECpower_ssj2008 Performance vs. Power Efficiency
The next figure shows the performance to power ratio as a function of performance, essentially charting the trade-off between efficiency and performance. The efficiency curve for Sandy Bridge-EP is fairly interesting as it peaks at 80%, and any higher utilization actually decreases power efficiency. While the efficiency for most systems tends to flatten at higher utilization, it still increases; the efficiency peak observed for Sandy Bridge-EP is fairly unique. One potential explanation is that beyond 80% utilization, the power management pushes the CPUs to a higher frequency and voltage that is less efficient. Another option is that platform components such as fans draw more power to ensure sufficient cooling for the CPUs and memory.
Overall, the Sandy Bridge-EP system is about 20% more efficient than Westmere-EP, when comparing at the same absolute performance level (e.g. they achieve roughly 720K ssj_ops at 50% and 90% utilization respectively). Of course, Sandy Bridge-EP’s performance also scales significantly higher and consumes roughly 30% less power at idle.
Discuss (15 comments)