Impressions of Kepler

Pages: 1 2

In the last few years, the competition between Nvidia and AMD has been quite interesting. At the 40nm generation, Nvidia chose to focus the Fermi architecture on throughput computing, blazing a trail to a new market. The GF104 was a derivative of Fermi, and tailored a bit more towards graphics, forming the second prong of Nvidia’s strategy. In contrast, AMD’s Cayman made no compromises in pursuing graphics performance. Consequently, AMD held a clear advantage for the graphics market, while Nvidia has been the only real option for GPU computing. At the 28nm node, it appears that Nvidia is continuing the existing strategy to diverge the graphics and compute products. This article gives a brief impression of Kepler, the focus on graphics and power managment and an analysis of Nvidia’s overall strategy and future compute products.

The Kepler architecture is the successor to the GF104/GF114 and swings the pendulum back, to focus more on graphics workloads, at the expense of general purpose computation. At a high level, the Kepler core (which Nvidia calls an SMX) is significantly larger than those found in GF104/GF114 (described as an SM). Each Kepler core can execute 192 single precision FMAs per cycle, four times the 48 FMAs in a GF104 core. The first Kepler product, the GTX 680 (or GK104) packs in 8 cores, the same as the GTX 560 (based on GF114). In essence, all the benefits of moving from 40nm to 28nm were used by Nvidia to design a much more powerful core. While a number of other aspects of the GPU changed, this article is chiefly concerned with the shader cores, rather than the fixed function graphics hardware.

Graphics First

The high level details of the Kepler core show a de-emphasis of general purpose workloads in favor of graphics. Kepler scaled up the computational capabilities of each core significantly, but the memory hierarchy did not keep pace. The first part of Table 1 compares the core computational and memory resources for Fermi, GF104/GF114, Kepler and AMD’s GCN. Work-item is used according to the OpenCL definition, however the term thread is not part of the OpenCL lexicon. The term thread refers to the microarchitectural unit of execution and control flow; a warp for Nvidia (with 32 work-items) and a wavefront (with 64 work-items) for AMD. As Table 1 indicates, Nvidia’s cores have a unified L1 data cache and shared memory, while in GCN, the two structures are separate. The second part of Table 1 compares the memory resources available to each work-item, and the bandwidth per FLOP provided by the cache.

Looking at the cache and shared memory available for each work-item, Kepler decreases capacity by 25% compared to the previous generations. The difference is about 10B, or two 32-bit data values. While this will hurt compute performance, it is not a huge change to the balance of resources. Moreover, Kepler’s register file capacity per work-item has increased by about 50% compared to both Fermi and GF104. For graphics, this should more than make up for the decrease in shared memory and data cache.

Table 1. GPU Core Computational and Memory Resources

The real change lies with the bandwidth from the data cache and shared memory, which is critical for shared data. The graphics pipeline is constructed to generate largely independent chunks of work, whether it is pixel or vertex shading. There is little communication between different work-items or work-groups, and almost all data stays in the private register file. However, general purpose workloads are an entirely different beast altogether. The vast majority of efficient HPC algorithms rely on sharing data, and truly general purpose workloads are even worse. For instance, in fluid dynamics the behavior at a given point typically depends on neighboring regions. In practice, most workloads need to communicate a certain amount of shared data for every computation, which is expressed as a ratio of bytes/FLOP. When a workload requires more communication bandwidth than is available on the hardware, performance suffers accordingly.

The shared data bandwidth for the Kepler core is 0.33B/FLOP with 32-bit accesses, just half of GF104. But the standard for general purpose workloads is not GF104. Fermi has 3× the shared data bandwidth (1B/FLOP) compared to Kepler. In comparison, AMD’s GCN has 1.5B/FLOP, demonstrating the advantages of a separate L1 data cache and local data share (LDS). The significant regression in communication bandwidth is one of the clearest signs that Nvidia has backed away from compute workloads in favor of graphics for Kepler. Note that using 64-bit accesses, the shared data bandwidth is actually 256B/cycle, which works out to 0.66B/FLOP (hence the asterisk in Table 1). However, existing CUDA programs are almost exclusively written with 32-bit accesses because earlier designs were fairly slow for 64-bit accesses.

The other architectural change that favors graphics is simplified scheduling. The JIT in Kepler’s graphics driver is now responsible for scheduling instructions that can execute without any register dependencies. The cores have eliminated register dependency analysis, although there is still scoreboarding for long (or unpredictable) latency instructions such as memory accesses. This approach saves power and area for graphics, which is relatively easy for a compiler to optimize. However, general purpose workloads are far less predictable and benefit from more dynamic scheduling; there is a reason that Fermi had such hardware in the first place.

Pages:   1 2  Next »

Discuss (68 comments)