Memory Bandwidth and GPU Performance

Pages: 1 2

GPU Performance Mysteries

In a recent article, we showed how to predict the performance of AMD and Nvidia GPUs. We built a straight forward model that uses the computational capabilities of the shader array (i.e. single precision GFLOP/s) to predict 3DMark Vantage GPU performance results. To keep things simple, we left out any multi-GPU solutions and also avoided dealing with AMD’s newer Cayman architecture. Part of the motivation for leaving out AMD’s VLIW4 designs is that the performance characteristics are very different from the previous VLIW5 architecture. The model can make reasonably accurate performance predictions. We tested the model by predicting performance for several desktop GPUs and the estimates were fairly close – typically within 6%. Considering that relatively little information about a GPU is necessary, the accuracy is very good.

Figure 1 – AMD GPU Performance

Looking at the data though, our model still leaves a lot of room for improvement. In particular, there are a number of very mysterious results, where GPUs with similar GFLOP/s in the shader core are achieving very different performance results.

Figure 2 – Nvidia GPU Performance

Figures 1 and 2 show our data set (in blue) and highlight several of these unexplained points for AMD and Nvidia GPUs (in slightly larger brown, green, red and orange squares). On the AMD side, there is one pair of cards at ~425GFLOP/s. Then there is another group of 2 at 800 GFLOP/s, and a third pair at ~900 and 1000 GFLOP/s. Looking at the plot for Nvidia shows even more odd results – there is a triplet of cards with 192 GFLOP/s, a pair near 150 GFLOP/s, two near 250 GFLOP/s and then another pair near 280 GFLOP/s. It is important to note that for Nvidia GPUs, we are ignoring the so called ‘missing MUL’; so our GFLOP/s rating is actually substantially different from Nvidia’s official marketing numbers.

In some cases, the GPU with the lower GFLOP/s actually delivers the best performance – which is totally counter-intuitive. One pair of points that perfectly illustrates counter-intuitive behavior is the first two AMD GPUs. The shader arrays provide 432 and 422 GFLOP/s respectively, but the first card only scores 2552 on 3DMark, while the latter scores a significantly higher 3463. One card has ~2% less shader compute, but 36% higher performance. This behavior is hardly isolated to AMD cards either. Three Nvidia GPUs have 192 GFLOP/s throughput in their shader arrays. Two of these cards score 3700 and 3374, while the third is a disappointing 2527. Despite having the same theoretical throughput, one of the cards is 46% faster than another.

What could be responsible for these mysterious and seemingly contradictory results? Looking at the basic architecture of a GPU like AMD’s Cayman , the shader array is just one part of the design. Admittedly it is perhaps the most important, but modern GPUs contain a variety of other hardware including fixed functions like the triangle setup engine, texture caches and sampling units, raster output pipelines (ROPs) and the memory controllers, while also relying on the driver software. Of these different areas, the one that is most critical to performance is the memory controllers and physical interfaces to DRAM. 3D graphics is an incredibly bandwidth hungry workload – to the point that high-end GPUs use bandwidth optimized GDDR5 DRAM rather than the less expensive DDR3 used for system memory. Note that in modern GPUs, each memory interface typically has its own ROPs – so to some extent, memory bandwidth will also take into account some fixed functions as well.

So our initial guess is that when two similar GPUs have substantially different performance, the real cause is the memory interfaces and available bandwidth. This seems eminently reasonable, especially since most CPU performance models also recognize the critical important of memory in determining the behavior of a workload.

Pages:   1 2   Next »

Discuss (11 comments)