Article: Parallelism at HotPar 2010
By: Richard (no.delete@this.email.com), August 9, 2010 4:58 am
Room: Moderated Discussions
Here are some numbers of my own benchmarks for an algorithm called collapsed cone superposition which is used to calculate the dose distribution of photon radiotherapy in heterogeneous tissue. It is basically a ray cast operation using non divergent rays trough a 3d voxel grid. The most significant function in terms of cost is 2-4 acos per voxel per ray per direction. In this test 106 directions where used and the voxel grid had a size of 128^3, which means at least 128^2 rays per direction (some times extensions of the entry plane are needed). The algorithm is embarrassingly parallel.
The GPU version is written in CUDA, the CPU version in C99. Both the GPU version and the CPU version are optimized implementations. The CPU version is optimized in relationship to cache usage and uses custom (less accurate where appropriate) math functions. The CPU version does not use SSE intrinsics, but instead relies on the compiler to use SSE, it could thus be optimized further for x86. I believe this explains the performance difference between the Intel compiler and the Microsoft one. The CPU version is multi-threaded and uses fine grained locking on shared data structures.
All calculations are done using single precision FP. The number of threads in the CUDA implementation are the "CUDA threads". We see here that a GTX260 core 216 is about 7.3 times faster than a Core i7 920 processor using 8 threads.
The reason the GPU does so well here is because the rays are non-divergent, and thus map really well to SIMD hardware. There is almost no branching, which means the vector paths are also non divergent. Also, the GPU version is slighty more accurate since the CPU version uses a less accurate acos() function, however using less accuracy actually reduced performance on the GPU since it normally uses the hardware acos (which is already less accurate).
Since this is close to optimal usage of GPU resources, I have a hard time believing 100x speedups for GPGPU unless we are talking about very specific optimizations such as using the texture samplers or very specific instruction mixes. Sometimes tuning for hardware can achieve nice results, see for instance this classic thread http://forum.beyond3d.com/showthread.php?t=54842 on hand optimizing SGEMM for ATI hardware.
The CPU is more flexible. We have to use 16 times as much memory on the GPU in the absence of (fast) mutexes to achieve optimal performance. This has been fixed somewhat in the latest generation of GPUs. Also, when we are allowed to use SSE intrinsics, the CPU performance could still go up somewhat, although we were mostly limited by other parts of the code and memory bandwith on older Intel processors, but the latter at least has been fixed in Nehalem. Alas, I do not have any newer benchmarks. I would be very interested in the latest CPUs versus NVIDIA's Fermi.
Results (convert to monospaced font for proper layout)
Platform A A A A B B C D
Compiler VC++8.0 ICC9.0 VC++8.0 ICC9.0 ICC9.0 ICC9.0 CUDA2.1 CUDA2.1
#threads 1 1 4 4 1 8 5376 13824
Dose (sec) 119.56 67.06 30.69 17.92 58.516 10.20 2.53 1.39
A Xeon E5430 2,66GHz
B Core i7 920 2,66GHz
C Quadro 3700FX / Xeon E5430
D GTX260 core 216 / Core i7 920
The GPU version is written in CUDA, the CPU version in C99. Both the GPU version and the CPU version are optimized implementations. The CPU version is optimized in relationship to cache usage and uses custom (less accurate where appropriate) math functions. The CPU version does not use SSE intrinsics, but instead relies on the compiler to use SSE, it could thus be optimized further for x86. I believe this explains the performance difference between the Intel compiler and the Microsoft one. The CPU version is multi-threaded and uses fine grained locking on shared data structures.
All calculations are done using single precision FP. The number of threads in the CUDA implementation are the "CUDA threads". We see here that a GTX260 core 216 is about 7.3 times faster than a Core i7 920 processor using 8 threads.
The reason the GPU does so well here is because the rays are non-divergent, and thus map really well to SIMD hardware. There is almost no branching, which means the vector paths are also non divergent. Also, the GPU version is slighty more accurate since the CPU version uses a less accurate acos() function, however using less accuracy actually reduced performance on the GPU since it normally uses the hardware acos (which is already less accurate).
Since this is close to optimal usage of GPU resources, I have a hard time believing 100x speedups for GPGPU unless we are talking about very specific optimizations such as using the texture samplers or very specific instruction mixes. Sometimes tuning for hardware can achieve nice results, see for instance this classic thread http://forum.beyond3d.com/showthread.php?t=54842 on hand optimizing SGEMM for ATI hardware.
The CPU is more flexible. We have to use 16 times as much memory on the GPU in the absence of (fast) mutexes to achieve optimal performance. This has been fixed somewhat in the latest generation of GPUs. Also, when we are allowed to use SSE intrinsics, the CPU performance could still go up somewhat, although we were mostly limited by other parts of the code and memory bandwith on older Intel processors, but the latter at least has been fixed in Nehalem. Alas, I do not have any newer benchmarks. I would be very interested in the latest CPUs versus NVIDIA's Fermi.
Results (convert to monospaced font for proper layout)
Platform A A A A B B C D
Compiler VC++8.0 ICC9.0 VC++8.0 ICC9.0 ICC9.0 ICC9.0 CUDA2.1 CUDA2.1
#threads 1 1 4 4 1 8 5376 13824
Dose (sec) 119.56 67.06 30.69 17.92 58.516 10.20 2.53 1.39
A Xeon E5430 2,66GHz
B Core i7 920 2,66GHz
C Quadro 3700FX / Xeon E5430
D GTX260 core 216 / Core i7 920