With VTune, we first profiled at the coarsest granularity – focusing on processes running in the system. Based on the number of instructions retired and cycles spent, we selected the top processes. To drill down further, we profiled the threads within each top process. Last, we selected the top threads and then profiled the modules within each top thread.
Chart 1 below shows the results from profiling the active processes for both workloads (Cryostasis and Soft Body Physics). In each case, we kept the top 10 process, as measured by the percentage of instructions retired. Generally, the percentage of cycles is closely correlated with the instructions retired, but there is some slight variation. In each of the charts, we bolded the entries that were important and selected for further analysis. The right hand side of the chart contains the number of events observed during the experiment, while the left hand side contains percentages for each type of event observed during the experiment. For example, 90.9% of the floating point SSE uops observed during our experiment were executed from the Cryostasis process.
Chart 1 – Process level view of PhysX applications
In Cryostasis, there is only one process of significance, cryostasis.exe itself; all others constitute roughly 2% of instructions retired and 10% of the cycles. Strangely enough, Cryostasis uses a tremendous amount of x87 instructions; roughly 31% of the instructions retired are x87. There are plenty of x87 uops, but hardly any SSE floating point uops, roughly a 100:1 ratio. Perhaps at finer granularity, it will be clear exactly where these x87 instructions are coming from. Despite the x87 instructions, the IPC is a respectable 1.15.
Similarly, the Soft Bodies demo is dominated by a single process which accounts for almost all instructions (97%) and cycles (87%). The SoftBodies.exe process is heavily weighted towards x87 instructions, which are 31% of all retired instructions, with few SSE floating point operations. Like Cryostasis, the IPC is pretty good, achieving 1.23, largely due to the structured nature of the underlying the physics code. The slight difference between the two probably reflects the additional code required for a game, rather than a simple screen demo.
Chart 2 – Thread level view of PhysX applications
Drilling down to the thread level in Chart 2, there are two significant threads within the cryostasis.exe process, although the labels defy easy comprehension. Thread99 is the more important of the two, accounting for 80% of the cycles and instructions retired, although thread24 is significant enough to note. Looking at thread99 in Chart 3, the vast majority of time is spent inside the PhysXCore.dll module, which uses no SSE and all x87 for floating point calculations (roughly 35% of instructions retired in PhysXCore.dll are x87). In fact, PhysXCore.dll is the culprit responsible for 91% of all x87 instructions retired in the entire process. Despite the use of x87, the IPC is fairly high, 1.4 instructions retired per cycle.
Chart 3 – Module level view of Cryostasis
Thread24 corresponds primarily to cryostasis.exe itself and is a smaller portion of the overall process (roughly 10%). Thread24 uses some SSE floating point operations, although this is still dwarfed by the overall use of x87 operations. There are roughly 3X as many x87 uops as SSE floating point uops, and the x87 instructions are 15% of the instructions retired in the module and 3% of the instructions retired in the process.
SoftBodies.exe has two principal component threads; thread71 is roughly 73% of the overall instructions retired and cycles, while thread1 is the remaining 26%. Thread71 is almost entirely composed of the PhysXCore.dll module. Again, this module does not use any SSE and instead relies on x87; an incredible 40% of the retired instructions are x87. Since the module dominates the process overall, it is not surprising that 95% of the x87 instructions retired in the process are found within this one module. The IPC for this module is similar to the IPC observed when executing cryostasis, a healthy 1.4, which helps to explain the overall IPC of the process.
Oddly enough neither workload is multithreaded in a meaningful way. In each case, one thread is doing 80-90% of the work, rather than being split evenly across two or four threads – or as is done in an Nvidia GPU, hundreds of threads.
Chart 4 – Module level view of SoftBodies.exe
The second and smaller thread1 is primarily ole32.dll, which is a library used by Windows for OLE (Object Linking and Embedding). The ole32.dll module has a little x87 code, about 6% of instruction retired, but far less than the massive 40% found in PhysXCore.dll. It’s not quite clear what the library is actually doing, but it only contributes a little to the overall use of x87.
Overall, the results are somewhat surprising. In each case, the PhysX libraries are executing with an IPC>1, which is pretty good performance. But at the same time, there is a disturbing large amount of x87 code used in the PhysX libraries, and no SSE floating point code. Moreover, PhysX code is automatically multi-threaded on Nvidia GPUs by the PhysX and device drivers, whereas there is no automatic multi-threading for CPUs.