Nehalem Performance Preview

Pages: 1 2 3 4 5 6 7 8 9 10 11 12

Euler 3D

The CASELab at Oklahoma State is a research group that works with NASA’s Dryden Flight Research Center to study and predict aero-elastic behavior; the interaction between inertial, elastic and aerodynamic forces. One example is flutter – when a lifting surface, such as a wing, oscillates due to aerodynamic forces and structural behavior. A mild flutter might produce a mild buzz in an aircraft, but more severe flutter could destroy an aircraft or other structure: the Tacoma Narrows Bridge collapsed due to flutter induced by a severe wind storm.Euler 3D is an application developed by Tim Cowan and multithreaded by Charles O’Neill, at CASELab for studying and predicting computational aeroelasticity. It is a multi-threaded, floating point and bandwidth intensive computational fluid dynamics (CFD) application. The benchmark data set analyzes the airflow over a specific wing configuration. Thanks go to Scott Wasson of Tech Report for sharing this benchmark with us. One of the nice things about Euler 3D, is that the number of threads is configurable, so that we can investigate how performance and power scale with respect to thread count.
Figure 10 – Euler 3D Performance
At maximum thread occupancy, Nehalem’s performance is 2.07X higher than Harpertown – the biggest gap we’ve seen. The higher performance is primarily due to increased bandwidth thanks to the integrated memory controller and on-die interconnects. CFD applications are notorious bandwidth hungry and for a long time, Intel’s performance on such applications was very limited as a result. SMT also provides a benefit by enabling higher bandwidth utilization and latency tolerance. The single threaded performance gain is also quite sizable, nearly 30% – which corresponds to a 20% IPC improvement and a 10% frequency boost from turbo mode.The performance for Euler 3D proved to be quite tricky. As it turns out, the observed performance for Nehalem at 2, 4 and 8 threads was unstable and subject to variation – to verify this behavior we ended up measuring performance six times at the intermediate thread counts. Performance was bi-modally distributed and 2 threads came in at either ~1.4Hz or ~2.3Hz; 4 threads was measured at ~3.8Hz or ~2.6Hz and 8 threads was either ~7.3Hz or ~6.8Hz. This is somewhat reflected in the chart – the scaling between 4 and 8 threads is extremely low, just 20% performance increase. What could the root of this mystery be?While we do not know for sure, it is very likely that the cause is the Windows thread placement algorithm interacting with certain features of the microprocessor and system. For best performance, threads should be balanced evenly between the two sockets and always favoring placement on a separate physical core, rather than sharing a core with another thread (due to the SMT). A policy along those lines maximizes the available frequency (due to turbo mode), bandwidth (spread across both memory controllers), cache (spread across each core) and execution units (spread across each core). An imbalance could easily drag down performance. Another culprit could be the placement of memory relative to the threads – a well known problem with some NUMA systems (an issue AMD ran into first, thanks to the integrated memory controller in the Opteron, back in 2003). If threads are placed such that they are accessing local memory, the performance will be noticeably better than if they are accessing remote memory. So that gives us three possible candidates for thread placement mistakes: the thread balance between sockets, thread balance between physical cores and thread balance with respect to memory.
Figure 11 – Euler 3D Power Consumption
The power consumption for Nehalem shown above hints at a possible answer. Looking at the change in power draw, adding a second thread ended up burning an extra 18W – so definitely another core fired up. Going from 2 to 4 threads increased the power draw by 14W/thread, so that means there are 4 physical cores active. Now going to 8 cores increases the power by 4W/thread – which seems like perhaps those threads were placed on already active physical cores. Then ramping up to full occupancy increases power by 7W/core – which seems like a mixture of virtual and physical cores; without more analysis (and some OS performance counters) though, this is just a hypothesis.Ultimately, Nehalem pulls about 20% less power than Harpertown at peak, which is quite amazing considering that the performance is >2X.
Figure 12 – Euler 3D Power Efficiency
Combining a 2X performance advantage and a 20% power advantage, the result is unsurprisingly a 2.5X improvement in power efficiency, huge by any standards – definitely the biggest so far.

Pages: « Prev   1 2 3 4 5 6 7 8 9 10 11 12   Next »

Discuss (52 comments)