The x87 floating point instructions are positively ancient, and have been long since deprecated in favor of the much more efficient SSE2 instructions (and soon AVX). Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005. In 64-bit versions of Windows, x87 is deprecated for user-mode, and prohibited entirely in kernel-mode. Pretty much everyone in the industry has recommended SSE over x87 since 2005 and there are no reasons to use x87, unless software has to run on an embedded Pentium or 486.
x87 uses a stack of 8 registers with an extended precision 80-bit floating point format. However x87 data is primarily stored in memory with a 64-bit format that truncates the extra 16 bits. Because of this truncation, x87 code can return noticeably different results if the data is spilled to cache and then reloaded. x87 instructions are scalar by nature, and even the highest performance CPUs can only execute two x87 operations per cycle.
In contrast, SSE has 16 flat registers that are 128 bits wide. Floating point numbers can be stored in a single precision (32-bit) or double precision (64-bit) format. A packed (i.e. vectorized) SSE2 instruction can perform two double precision operations, or four single precision operations. Thus a CPU like Nehalem or Shanghai can execute 4 double precision operations, or 8 single precision operations per cycle. With AVX, that will climb to 8 or 16 operations respectively. SSE also comes in a scalar variety, where only one operation is executed per instruction. However, scalar SSE code is still somewhat faster than x87 code, because there are more registers, SSE instructions are slightly lower latency than the x87 equivalents and stack manipulation instructions are not needed. Additionally, some SSE non-temporal memory accessses are substantially faster (e.g. 2X for AMD processors) as they use a relaxed consistency model. So why is PhysX using x87?
PhysX is certainly not using x87 because of the advantages of extended precision. The original PPU hardware only had 32-bit single precision floating point, not even 64-bit double precision, let alone the extended 80-bit precision of x87. In fact, PhysX probably only uses single precision on the GPU, since it is accelerated on the G80, which has no double precision. The evidence all suggests that PhysX only needs single precision.
PhysX is certainly not using x87 because it contains legacy x87 code. Nvidia has the source code for PhysX and can recompile at will.
PhysX is certainly not using x87 because of a legacy installed base of older CPUs. Any gaming system purchased since 2005 will have SSE2 support, and the PPU was not released till 2006. Ageia was bought by Nvidia in 2008, and almost every CPU sold since then (except for some odd embedded ones) has SSE2 support. PhysX is not targeting any of the embedded x86 market either; it’s designed for games.
The truth is that there is no technical reason for PhysX to be using x87 code. PhysX uses x87 because Ageia and now Nvidia want it that way. Nvidia already has PhysX running on consoles using the AltiVec extensions for PPC, which are very similar to SSE. It would probably take about a day or two to get PhysX to emit modern packed SSE2 code, and several weeks for compatibility testing. In fact for backwards compatibility, PhysX could select at install time whether to use an SSE2 version or an x87 version – just in case the elusive gamer with a Pentium Overdrive decides to try it.
But both Ageia and Nvidia use PhysX to highlight the advantages of their hardware over the CPU for physics calculations. In Nvidia’s case, they are also using PhysX to differentiate with AMD’s GPUs. The sole purpose of PhysX is a competitive differentiator to make Nvidia’s hardware look good and sell more GPUs. Part of that is making sure that Nvidia GPUs looks a lot better than the CPU, since that is what they claim in their marketing. Using x87 definitely makes the GPU look better, since the CPU will perform worse than if the code were properly generated to use packed SSE instructions.