The Limits of GPUs
A talk by Rich Vuduc from Georgia Tech clarified the performance potential of GPUs viz a viz CPUs and outlined the limits of their acceleration. Rich began the talk with a satirical anecdote by David Bailey appearing in Supercomputing Review, from 1991. The article, on “Twelve ways to fool the masses…” regarding benchmarketing and parallel computer performance, quipped that “Unfortunately, with a little tuning many applications run quite fast on Crays. Therefore you must be careful not to do any tuning on the Cray code” in order to favorably compare one’s own favorite supercomputer against the highly regarded Crays of the time.
Figure 10 – Developer effort versus performance for CPUs and GPUs
In order to debunk the claims of GPUs being orders of magnitude faster than CPUs, a “meta-analysis” was conducted of semi-irregular applications consisting of sparse iterative and direct solvers, and tree based particle methods on Intel 2-socket, 4-core Nehalem EP (8 cores total) and select Nvidia setups. By the presenter’s own admission of an apples to oranges preliminary comparison, given the same level tuning, how would a T10P GPU with about 3x the peak memory bandwidth and 1.8x peak DP (11x peak SP) compute potential compare to the CPU system? First, to give a sense of the tradeoffs involved, the authors presented parallel sorting microbenchmark results (shown above in Figure X) with a range of languages and platforms plotted along with their runtimes and lines of code metrics, though the problem size was not disclosed. Standard serial C code at 25 lines produces a runtime of 2.86 seconds. Cilk++ at 56 lines of code for the same problem gives a time of 1.17 seconds on a sixteen core machine. CUDA code with 282 lines and algorithmic changes results in a runtime of 0.07 seconds on an unnamed Nvidia GPU. Clearly if the problem admits sufficiently parallel implementations, and it is worthwhile, developer productivity can be traded for orders of magnitude runtime improvements. This applied to both the GPU and the CPU, though in this instance the C version was not threaded and tuned to run on the CPU.
Figure 11 – Sparse iterative solver performance for naïve CPU, GPU and tuned CPU implementations
Figure 11 above shows a case study focusing on sparse iterative solvers. This bandwidth limited DP computation benchmark was run on a GTX 285 and an untuned Nehalem, yielding a 6.5x advantage for the tuned GPU code. However, with tuning for the CPU code, the performance difference drops to 2.1X in favor of the single chip GPU. A second case study involving SP computation examined tree based fast multipole methods for simulating particle systems interactions and tested a Tesla C1060 against untuned single threaded code on the Nehalem system with the former coming up 31x faster. Merely tuning and vectorizing the Nehalem code yields an eight fold performance benefit. However a different story emerges after the Nehalem code is both tuned and parallelized with OpenMP. The CPU system tops the single chip GPU in this instance and is a full 53x faster than the baseline, 70% faster than the Tesla! A two chip Tesla C1060 setup yields a 58x improvement over the baseline Nehalem code, edging the GPU past the Nehalem system. The results clearly paint a more complex picture of CPU and GPU performance potentials than have been presented. However an Nvidia engineer present at the workshop indicated that nVidia GPUs bought for HPC applications do not typically run such “irregular” workloads but are purchased for more regular EP scientific computing codes, which was admittedly outside the scope of the talk. Claims were made verbally that the power efficiency of the two approaches was similar during the testing, though no quantitative results were presented.
Figure 12 – Tree based fast multipole method performance for naïve CPU, GPU and tuned CPU implementations
Various themed lunchtime discussions took place at the workshop. Interesting conversations revolved around the subject regarding the place of parallelism education in curricula. One topic was whether to introduce a required self-contained undergraduate course on parallelism for computer science students. It was felt that “ghettoizing” parallelism into its own class is not desirable, especially when parallelism and concurrency are cross-cutting issues involving the whole computing stack. Instead, Dan Grossman of the University of Washington advocated gradual introduction of parallelism themes in various courses in the undergraduate curriculum. Dr. Grossman recently concluded a second year data structures course in which various parallel patterns were taught, as well as introducing a restricted model of concurrency whereby threads are forked/joined in a lattice like fashion – eliding undesirable interleavings – while operating on non-shared data. Such a model permits sophomore students to reason about the parallel algorithm implementation without the unnecessary complications of all out concurrency.
 Personal communication
Papers and presentations from Hot Par 2010 can be found online for the interested reader.