The CASELab at Oklahoma State is a research group that works with NASA’s Dryden Flight Research Center to study and predict aero-elastic behavior; the interaction between inertial, elastic and aerodynamic forces. One example is flutter – when a lifting surface, such as a wing, oscillates due to aerodynamic forces and structural behavior. A mild flutter might produce a mild buzz in an aircraft, but more severe flutter could destroy an aircraft or other structure: the Tacoma Narrows Bridge collapsed due to flutter induced by a severe wind storm.
Euler3d is an application developed by Tim Cowan and multithreaded by Charles O’Neill, at CASELab for studying and predicting computational aeroelasticity. It is a multi-threaded, floating point and bandwidth intensive computational fluid dynamics (CFD) application. The benchmark data set analyzes the airflow over a specific wing configuration. One of the nice things about Euler3d, is that the number of threads is configurable, so that we can investigate how performance and power scale with respect to thread count. Performance is measured in the frequency for calculating CFD cycles, and reported in Hz.
Euler3d uses an unstructured grid, so that the mesh resolution can vary. In regions where precise modeling is required, the grid will be quite dense, while areas that are simpler to model will have a sparser grid. This approach is highly efficient because it avoids unnecessary computation, but the downside is that the grid has poor locality. Each node depends on the behavior of both adjacent and distant nodes, so the overall workload is highly dependent on the performance of random accesses and overall memory bandwidth.
Figure 1. Euler3d Performance
Our previous reviews have shown that Windows will place threads on the same processor socket, rather than attempting to evenly distribute work throughout the system. While advantageous for power consumption, it also produces the peculiar performance profile. As Sandy Bridge-EP scales from 8 to 16 threads and Westmere-EP goes from 6 to 12 threads, the performance barely budges, but it doubles when the second chip is fully engaged.
Euler3d is a clear cut demonstration of the tremendous improvements in the memory controller for Sandy Bridge-EP. A single chip has 16 threads and four channels of memory. Yet it achieves similar performance to the entire Westmere-EP system, which collectively boasts 24 cores and 6 channels of DDR3. The overall performance gain is 73%, which is far more than the paper specifications would suggest. In all likelihood it is due to a combination of the second load port for the L1D, the high frequency ring bus and the faster memory controllers.
Figure 2. Euler3d Power Efficiency
The Sandy Bridge-EP server consumes 411W under a full load, compared to just 311W for Westmere-EP. The power efficiency is still 30% better, but certainly less remarkable than the dramatic boost in performance.
One additional note, Euler3d has not been recompiled to use AVX. Since unstructured grids are notoriously difficult to vectorize, the potential benefits will come from two areas. The first is 256-bit wide AVX loads and stores, which may increase the bandwidth from the data cache. Second, 3-operand instructions might eliminate extraneous MOV instructions from the code. The gather and FMA instructions in AVX2 might also prove useful, but that is a moot point till Haswell arrives.
Discuss (15 comments)