Nvidia unquestionably uses PhysX as an exclusive marketing tool for their GPUs, and it clearly benefits from executing on a GPU. Nvidia claims that a modern GPU can improve physics performance by 2-4X over a CPU. That’s a pretty impressive claim, and some benchmarks (e.g. Cryostasis) seem to bear that out. However, detractors of Nvidia (largely those working at one of Nvidia’s competitors) have repeatedly claimed that PhysX purposefully handicaps execution on a CPU to make GPUs look better. Of course, comments from a competitor should be taken with a large grain of salt. But if Nvidia does cripple CPU PhysX, it would throw into question the extent to which GPU PhysX is really beneficial. Certainly a 4X advantage is worth while. However, if the CPU is really hobbled and runs 2X slower by design, that would mean that the GPU only has a 2X advantage in reality, which is far less impressive.
A couple months ago, we decided we would profile a couple of applications which use PhysX to test how PhysX behaves on the CPU and GPU. Initially, we were going to use VTune to compare, contrast and analyze both GPU accelerated and CPU PhysX by collecting performance counter data. However, after we first ran the experiment with VTune to analyze PhysX execution on the CPU, our results were so strange that we changed our plan to focus solely on profiling CPU PhysX and examine how it is tuned for the CPU.
Our test system is a relatively modern 3.2GHz Nehalem (Bloomfield), with a Nvidia GTX 280 GPU and 3GB of memory (3 DIMMs). It runs Windows 7 (64-bit), with nvcuda.dll version 126.96.36.19921 and PhysX version 09.09.1112. To test PhysX, we used the Cryostasis tech demo and also the Dark Basic PhysX Soft Body Demo and analyzed the execution using Intel’s VTune. In each case, the NV control panel was set to disable hardware PhysX acceleration, and then run with VTune. For comparison, the two tests were also run with GPU accelerated PhysX. As expected, the GPU accelerated versions ran at a reasonable speed with very nice effects. However, the CPU chugged along rather sluggishly. There was a very clear difference in performance that shows the benefits of accelerating PhysX on a GPU.
VTune analyzes the execution of an application at several levels of granularity. The coarsest is the processes running in the system. From there, VTune can drill down into interesting processes and examine the threads within the process. The finest granularity is inspecting the individual modules executed within each thread. For each of the tests, we analyzed at every level and highlighted the key processes, threads and modules being used. We also tracked several performance counters, which are reported in our results:
- Cycles – The number of unhalted clock cycles
- Instructions – The total number of instructions retired
- x87 instructions – The total number of x87 instructions retired, which will be a portion of the overall instructions retired
- x87 uops – The number of x87 uops executed (note that a uop can be executed, but then squashed e.g. due to branch misprediction).
- FP SSE uops – The number of floating point SSE uops executed (this includes SSE1 and SSE2 uops, both scalar and packed)
VTune also tracks the Instruction Per Cycle (or IPC), which is the average number of instructions retired each cycle. Nehalem can retire up to 4 instructions per cycle, and realistically it can probably sustain an IPC of 0.5-1.5 on most workloads.
One essential reminder: VTune uses statistical sampling, and thus the accuracy depends on the number of samples. If there are relatively few samples, then the numbers may vary substantially. In general, the longer running processes/threads/modules will be sampled more often and hence generate more accurate data, while those processes/threads/modules which run only briefly may yield less than ideal results. One advantage of working with modern CPUs is that they execute billions of cycles per second, so the law of large numbers ensures that the results are accurate and stable.
While it would be nice to track many more performance counters, we were ultimately limited by the amount of time available, and frankly many of the counters were relatively uninteresting in the context of PhysX. The results of our profiling are on the next page.