All the advances in programmability are interesting, but fundamentally rely on Nvidia’s software team to unlock them for developers.
The standard APIs are obvious candidates here – Fermi is designed for the major ones: DX11 and DirectCompute on Windows, OpenCL and OpenGL for the rest of the world. OpenCL 1.0 is relatively nascent, having been only recently finalized, and DX11 and DirectCompute are not yet out. While these are unquestionably the future for GPUs, OpenCL and DirectCompute lack many of the niceties that Nvidia offers with the proprietary CUDA environment and APIs.
CUDA is generally focused on providing language level support for GPUs. This makes sense as it leverages some familiarity on the part of developers. But the reality is that the languages which CUDA supports are variants of the original languages with proprietary extensions and only a subset of the full facilities of the language. Currently, Nvidia has CUDAfied C and Fortran, and in the future with Fermi, they will have a version of C++. Nvidia’s marketing is makinig ridiculous claims that they will eventually have Python and Java support, but the reality is that neither language can run natively on a GPU. An interpreted language, such as Python would kill performance, and so what is likely meant is that Python and Java can call libraries which are written to take advantage of CUDA.
Despite being proprietary, the ecosystem that Nvidia is creating for CUDA developers is promising. While it’s not the rich ecosystem of x86, ARM or PPC, it is miles ahead of OpenCL or DirectCompute. Some of the tools include integration with Visual Studio and GDB, a visual profiler, improved performance monitoring with Fermi, and standard binary formats (ELF and DWARF). Nvidia also has their own set of libraries, which can now be augmented with 3rd party libraries that are called from the GPU.
Now that standards-based alternatives such as OpenCL exist, CUDA is likely to see slower uptake. Many customers have learned to avoid solutions from a single source, IBM for example with Intel’s x86 chips in the original PC. But CUDA will retain strategic importance to Nvidia as a way to set the pace for OpenCL and DirectCompute.
Fermi’s architecture is a clear move towards greater programmability and GPU based computing. There is a laundry list of new features, all of which will enable Fermi, when it is released to make greater inroads into the relatively high margin HPC space. Some of the more notable changes include updating the programming and memory model, embracing semi-coherent caches and improved double precision performance and better IEEE compliance. It’s clear that Nvidia is making a multi-generation investment to push GPU computing in the high-end, although we will have to wait until products arrive to determine the reception.
Since there are no details on products, many key performance aspects are unknown. Frequency is likely in the same range (+/-30%) as GT200, and the GDDR5 will probably run between 3.6-4.0GT/s, but power and cooling are unknown and could be anywhere from 150-300W. The bandwidth and capacity for a DDR3 based solution is also unknown. So from a performance stand point, it’s very hard to make any meaningful comparison to AMD’s GPU, which is actually shipping. The shipping dates for graphics and compute products based on Fermi are also unclear, but late Q4 seems to be the earliest possible with low volumes, while actual volume won’t occur till 2010 – so evaluating performance will have to wait till then.
Perhaps the most significant demonstration of Nvidia’s commitment to compute is the fact that a great deal of the new features are not particularly beneficial for graphics. Double precision is not terribly important, and while cleaning up the programming model is attractive, it’s hardly required. The real question is whether Nvidia has strayed too far from the path of graphics, which again depends on observing and benchmarking real products throughout AMD’s, Nvidia’s and Intel’s line up; but it seems like the risk is there, particularly with AMD’s graphics focus.
These are all important questions to ponder in the coming weeks, and really feed into the ultimate technical question – the fate of CPU and GPU convergence. Will the GPU be sidelined for just graphics, or will it be like the floating point coprocessor, an essential element of any system? Will it get integrated on-die, and to what extent will the discrete market remain? These are all hard to predict, but it’s clear that Nvidia is doubling down on the GPU as an integral element of the PC ecosystem for graphics and compute and time will tell the rest.
 J. van der Laan, Wladimir. Decuda sm_1_1 (G80) disassembler version 0.4.x Read Me. http://github.com/laanwj/decuda/raw/master/README
 Huang, L. et al. A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design. IEEE Symposium on Computer Arithmetic, 2008. http://www.lirmm.fr/arith18/papers/libo-multipleprecisionmaf.pdf
 Papadopoulou, M. et al. Micro-benchmarking the GT200 GPU. http://www.eecg.toronto.edu/~moshovos/CUDA08/arx/microbenchmark_report.pdf
 Colange, S. et al. Power Consumption of GPUs from a Software Perspective. 11