The CASELab at Oklahoma State is a research group that works with NASA’s Dryden Flight Research Center to study and predict aero-elastic behavior; the interaction between inertial, elastic and aerodynamic forces. One example is flutter – when a lifting surface, such as a wing, oscillates due to aerodynamic forces and structural behavior. A mild flutter might produce a mild buzz in an aircraft, but more severe flutter could destroy an aircraft or other structure: the Tacoma Narrows Bridge collapsed due to flutter induced by a severe wind storm.
Euler 3D is an application developed by Tim Cowan and multithreaded by Charles O’Neill, at CASELab for studying and predicting computational aeroelasticity. It is a multi-threaded, floating point and bandwidth intensive computational fluid dynamics (CFD) application. The benchmark data set analyzes the airflow over a specific wing configuration. Thanks go to Scott Wasson of Tech Report for sharing this benchmark with us. One of the nice things about Euler 3D, is that the number of threads is configurable, so that we can investigate how performance and power scale with respect to thread count.
Figure 8 – Euler 3D Performance
Euler 3D, like many HPC applications is extremely bandwidth hungry. That Westmere is 38% faster than Nehalem (when both are fully loaded) is perhaps the most conclusive proof that the triple channel memory interface was over-designed for Nehalem, but just right for Westmere. If Westmere had been bandwidth starved it would show up in this benchmark quite visibly.
Previously when testing Euler 3D with Windows Server 2008 R1, we had problems producing consistent performance results. When faced with 16 hardware threads spread across 8 cores on 2 sockets, the Windows scheduler had issues selecting both the *right* thread placement, as well as consistently choosing the same thread placement. Fortunately with R2, it appears that at least the thread placement is at least consistent (as are our performance results). However, judging by the shape of the performance curves in Figure 8 it looks like the thread placement leaves a bit to be desired. Both Westmere and Nehalem show little performance gains when going from 6 to 12 and 4 to 8 threads respectively, but a large gain going up to the maximum number of threads. Ideally, the scheduler would prefer to place a single thead on each idle core to maximize performance. However, it appears that in some cases the scheduler ends up putting 2 threads on some cores, while keeping other cores idle entirely. This may be a purposeful optimization to reduce power consumption, since most server workloads actually show significant benefits from SMT; HPC applications tend to be a bit of an exception here.
Figure 9 – Euler 3D Power Consumption
The power consumption is fairly consistent with the above hypothesis. For Westmere and Nehalem going from 6 to 12 and 4 to 8 threads (respectively) increases power consumption, but very modestly. Each incremental thread burns around 3 or 5 watts. However, a full core uses around ~10W.
Figure 10 – Euler 3D Power Efficiency
The power efficiency for Westmere is pretty much inline with the performance gains; a 35% improvement at peak load. Again at the intermediate region for Nehalem and Westmere, the power efficiency is constant. It seems likely that with a more careful thread placement algorithm, the odd ‘hump’ in all of the curves could be shifted to the right most edge. An interesting experiment would be to disable SMT in the BIOS and re-run the tests to see the performance results. It’s likely that Westmere and Nehalem would probably have similar peak performance with and without SMT.