MAQSIP-RT: An HPC Benchmark

Pages: 1 2 3

MAQSIP-RT Performance

The code for MAQSIP-RT is a combination of C and Fortran. Given that the underlying system uses AMD microprocessors, there are four reasonably high performance compiler choices: Intel, PathScale, Portland Group or Sun. For an Intel platform, ICC would probably be the best choice, but Intel’s main objective is achieving high performance on their own processors, not those of rival AMD. The other three compiler groups have a long history of extensive tuning and optimization for AMD processors and are better candidates. Of the three, Sun Studio Express is the obvious choice since it is free and readily available for download.

The following settings were used in the makefile for MAQSIP-RT:

MFLAGS = -xarch=native -xtarget=native -m64 -debugformat=dwarf
OMPFLAGS = -openmp -stackvar
OMPLIBS = -mt -lnsl
COPTFLAGS = -O3 ${MFLAGS}
FOPTFLAGS = -O3 ${MFLAGS} -M. -xfilebyteorder=big8:%all
FPPFLAGS =
FSFLAGS = -fno-automatic
ARCHFLAGS = -DFLDMN=1 -DAUTO_ARRAYS=1 -DBIT32=1 -DF90=1
PARFLAGS =
ARCHLIB = -Bdynamic -lfsu -lc

Selecting the data set for this benchmark was a bit of balancing act as well. We want a realistic and representative data set, not a toy. Like many of our server workloads, we want to observe how different core counts perform, scaling up from 1 core to the maximum of 12. However, we also want a reasonable execution time for each configuration, covering a theoretical factor of 12 difference in performance (and this range will increase with more cores over time). Excessively long run times can be quite problematic from a practical stand point, especially if problems with a benchmark run are only detected after the execution is complete.


Figure 1- Terrain Height Map for Simulated Region

As shown in Figure 1, we simulated the eastern portion of the United States, extending east from the Rocky Mountains to the Atlantic Ocean, and adjacent parts of Canada. According to Carlie, the horizontal grid is 160 rows by 163 columns, on a Lambert Conformal Conic map projection at 15KM resolution. The simulation is for a 24-hour high-ozone period starting 18:00:00 GMT August, 7, 2007.

Figure 2 below shows the results of our benchmark. The scaling for MAQSIP-RT is very good, with 12 cores achieving 10.96x the performance of a single core or 91% of the ideal speed up.


Figure 2- MAQSIP-RT Performance

Our tests not only varied the count, but also the placement of threads. Thread placement can have a substantial impact on both power and performance. For workloads with little sharing, using cores on both sockets maximizes the available memory bandwidth. However, with a properly designed cache hierarchy, workloads with a lot of shared data may benefit from residing on a single chip instead of spanning both sockets and requiring cross chip coherency. Placing all threads on one processor may reduce power consumption substantially if the idle processor can go into a low power state.

Surprisingly the performance delta between the different thread allocation schemes was negligible, less than 1%. In both cases (2 threads and 6 threads) the performance favors allocating threads on the same socket, rather than evenly spreading them between sockets. This suggests that even at 6 threads per socket, the bandwidth is sufficient for MAQSIP, and that there may be a modest amount of shared data.


Pages: « Prev   1 2 3   Next »

Discuss (102 comments)