Our GPU Performance Model
The data set we will use comes from notebookcheck.net and includes a wide variety of notebook GPUs. For performance, we will use the graphics score of 3DMark Vantage. The benchmark data is for the performance profile at 1280×1024 resolution and excludes any physics results. The data set originally included ~120 different DX10+ notebook GPUs. We corrected a number of data errors and also removed several results to simplify our work. Specifically, we did not consider any multi-GPU configurations (SLI or Crossfire) and we also eliminated any integrated graphics solutions (e.g. Intel’s IGPs).
In general, simplicity is a virtue. While complex models are typically more accurate, it is best to use the simplest model that achieves a sufficient level of precision. This is doubly true when the data is of slightly uncertain provenance. In our case, the performance data seems reasonable, but there is a great deal of information that is unavailable. For instance, the notebook CPU, chipset and BIOS, graphics driver version, cooling would all be useful to form a better picture.
As we previously mentioned, graphics workloads are inherently parallel, so it is fairly safe to assume that performance scales directly with core count, and that the parallel scaling factor is ~1. Graphics workloads are predominantly single precision with some integer arithmetic as well. For our model, we will primarily use the theoretical single precision GFLOP/s in the shader array as a proxy for performance. GFLOP/s clearly encompasses the number of shader cores, the frequency and the optimal instructions per cycle in each shader core.
GFLOP/s does not account for the actual IPC, nor does it account for the number of instructions that the driver produces for the benchmark. The graphics cards from Nvidia are based on the G80, GT200, Fermi and GF104. These microarchitectures are fairly similar and should have very close utilization of the shader core. The AMD GPUs range from the older RV670 up to Cypress and Cayman. To ensure that the AMD microarchitectures are all relatively similar, we omitted two results that were based on Cayman. Since Cayman uses a VLIW4 instead of VLIW5 shader core, the utilization will be substantially different compared to the previous generations and would throw off our model. For more accuracy, we should probably separate out each different microarchitectures into its own data set, but that turns out to be unnecessary. Removing Cayman is sufficient for the moment.
Most significantly, GFLOP/s neglects the important role of memory bandwidth and the fixed function hardware. To some extent, our model is implicitly assuming that AMD and Nvidia will scale up the memory bandwidth, texture units and ROPs in tandem with the shader array. Overall, that is a fairly good assumption for discrete single GPU solutions. However, it is manifestly not the case for integrated graphics, which often has vastly less bandwidth than a discrete GPU; similarly, multiple GPUs often have a different balance of memory bandwidth to compute. This is one reason why we removed such data from our analysis.
Figure 1 – 3DMark Performance versus GFLOP/s for Notebook GPUs
One of the advantages of a simple model with a single variable is visualization. Figure 1 shows a scatter plot of single precision GFLOP/s against the 3DMark score. Fittingly, Nvidia’s graphics cards are picked out as green squares, while AMD’s are shown as red triangles. A more complicated model with several variables would be much harder to display and a bit less intuitive to explain. The scatter plot also includes two linear regressions and R2 coefficients, one for the AMD cards and one for Nvidia.
Simply looking at the data, our model appears to be fairly good. The benchmark scores scale quite closely with GFLOP/s. The regression lines fit quite well, an excellent sign. Without diving too far into statistics, the R2 coefficients are very close to 1. This means that statistically speaking, our model explains most of the differences in performance (97% for Nvidia and 96% for AMD) by simply looking at the theoretical shader throughput. It appears that single precision GFLOP/s is actually a pretty good measure of performance for a given microarchitecture.