Impressions of Kepler

Pages: 1 2

Power Management

Another major change in Kepler is the clocking and power management. Prior designs had two clock frequencies. There was ‘graphics clock’ for fixed function hardware such as the ROPs or L2 cache that ran at around 600-800MHz. The ‘hot clock’ for the cores (or SMs) ran at twice the frequency.

Kepler eliminates the hot clock and moves to a single frequency for the majority of the GPU. Lowering the frequency directly reduces clocking power and also eliminates clocked storage elements (e.g. latches). It also means that the cores can be implemented with denser and more power efficient transistors. With more compact logic, each Kepler core can pack in more execution units to make up for the frequency loss. In conjunction with the shrink to 28nm and simpler control logic, the execution pipelines in each core quadrupled. Additionally, the entire shader array is power gated to eliminate leakage.

More significantly, Kepler is Nvidia’s first GPU with Dynamic Voltage and Frequency Scaling (DVFS). For CPUs, this technique has been standard for nearly 5 years. Sandy Bridge uses DVFS for both the CPU and GPU, while Llano’s DVFS is restricted to the CPU cores.

Kepler’s DVFS is a relatively simple, platform level approach. The GPU power is directly measured using the VRMs. The driver is responsible for selecting a frequency and voltage pair, based on the measured power and temperature and a limit that the user can set (which should be welcome by overclockers). Most DVFS systems do not measure current or power directly, because external measurements are very high latency. Instead, on-die performance counters are used to quickly estimate the switching activity and power draw.

The base frequency for the GTX 680 is 1.006GHz, which is about 20% faster than the GTX 560. The DVFS can increase the base clock in increments of 13MHz to 1.11GHz and Nvidia claims that the average frequency is 1.058GHz, a 5% boost. The downside of measuring at the VRM is that the latency of the control loop is fairly high, around 100ms. To put that in context, rendering a frame at 30 or 60FP/S translates into 30ms and 16ms respectively. Early testing from Scott Wasson indicates that the high latency can be problematic, although a driver fix is expected to reduce or eliminate this issue. Stepping back, it will be interesting to see the impact of Nvidia’s DVFS on notebook GPUs, which are considerably more power constrained and should see greater benefits.

This is Nvidia’s first implementation of DVFS for a GPU, so it is understandable that it would be relatively simple. Future versions will undoubtedly be more sophisticated and shift towards on-die power estimation to reduce the latency of the control loop. It is also possible that the actual power management decisions will eventually move from the driver into hardware. Although the driver has an advantage when dealing with multi-GPU configurations, since it maintains visibility into all the GPUs.


Kepler is Nvidia’s first new graphics architecture in several years. As the successor to the GF104/GF114, it will form the basis of graphics products for the next few years on TSMC’s 28nm process.

The GTX 680 is the first implementation and demonstrates excellent results. The aggregate single precision shader performance is 3TFLOP/s at the base frequency, about 2.4× faster than the GTX 560. Perhaps the most encouraging part of the story is that Kepler appears to be remarkably area and power efficient. The GTX 680 is a 195W TDP card and the GPU packs in 3.54B transistors in 294mm2. Previous generations from Nvidia were quite inefficient, likely due to the focus on general purpose computational workloads. The first Kepler products significantly improve the GFLOP/S/W and GFLOP/S/mm2 beyond simply process technology scaling, which bodes well for the architecture.

As an added bonus, Nvidia’s memory interfaces seem to have finally matured. For the last 4-5 years, memory was a persistent weakness in every Nvidia GPU and a significant competitive disadvantage. Simply put, Nvidia never seemed to be capable of designing high speed I/Os. The GTX 680’s GDDR5 memory has caught up with and exceeded AMD, reaching 6GT/s. While this is only the first 28nm product, it is certainly a positive sign.

The catch is that the Kepler core is a poor fit for compute applications. The excellent efficiency for graphics has undoubtedly come at the cost of general purpose workloads. As our analysis showed, Nvidia’s architects made a conscious choice to quadruple the FLOPs for each core, but only double the bandwidth for shared data. The result is that the older Fermi generation is substantially better suited to general purpose workloads and will continue to be preferred for many applications.

The real question is whether future compute products will actually use the Kepler core. The truth is that Nvidia is not backing away from using GPUs for general purpose workloads, even as it becomes increasingly competitive. AMD’s GCN is an excellent fit in that market and dramatically outperforms any of Nvidia’s offerings. Intel is also expected to release the 22nm Knights Corner at the end of 2012 or early 2013, which will be an excellent option by virtue of inheriting the existing x86 ecosystem. Nvidia has extensively invested in software for compute products (e.g. CUDA), and that should tide the company over temporarily. However, going after general purpose workloads with a graphics optimized design in a highly competitive market is unlikely to succeed. Nvidia’s ecosystem is certainly good, but Intel’s resources and depth of experience is unquestionably greater.

Given this situation, it seems highly likely that Nvidia’s upcoming compute products will use a core that is tuned for general purpose workloads. It will be a derivative of Kepler, to re-use as much of the engineering effort as possible, but with several significant changes.

The first is floating point performance, the GK104 cores are 24× slower for double precision, compared to single precision. This is fine for graphics since double precision is really just there for compatibility. However, Fermi was half speed and Nvidia needs at least 1TFLOP/s for a competitive product. Second, the cores need to be rebalanced to achieve a better balance of B/FLOP through shared memory and the data caches. The easiest approach is probably cutting the number of execution units per core in half, and scaling up the core count. It is possible that Nvidia will have more scheduling hardware, to avoid relying on the compiler, but that seems like a rather large investment with an unclear return. Last, the register files, caches, shared memory and DRAM need error protection in the form of ECC.

In summary, Kepler and the GK104 represent a tremendous milestone for Nvidia. It eliminates the efficiency flaws that were found in previous generations, demonstrates good memory bandwidth, the start of a DVFS strategy and robust execution at 28nm. The graphics performance is excellent, with no compromises, and attractive power and area efficiency. This success stems from tuning the Kepler core and GK104 almost exclusively for graphics. Going forward, it appears that Nvidia’s strategy will rely on two divergent designs; one specialized for graphics and the other for compute workloads. In the next few months, it should become apparent how this will play out, but for now the graphics side looks good.

Pages: « Prev  1 2  

Discuss (68 comments)