While Intel’s GPUs have been fully programmable for many years, the interfaces for developers were quite limited. Intel’s QuickSync technology is used for programmable video processing, but is not truly meant for general purpose applications. The challenge for Ivy Bridge is to expose the underlying programmability to developers through industry standard APIs. Ivy Bridge targets DirectX 11.0, OpenGL 3 and OpenCL 1.1 at launch. OpenGL 4 should come later, through a driver update, but the hardware features are fully present. Compatibility with DX 11.1 and OpenCL 1.2 will have to wait for the next generation, Haswell, although the differences are not terribly profound.
These APIs in turn dictate a number of requirements and recommendations for the hardware. DirectX 11.0 mandates tesselation for graphics. The compute shader requirements for OpenCL and DirectX are relatively similar, notably both have a local memory space for communicating within a work-group, barriers and atomic operations.
While double precision floating point is optional for OpenCL and DirectCompute, it is the hallmark of programmable GPUs. There are simply too many algorithms, languages and applications that rely on double precision, and it is also a matter of compatibility. The value of an interface like OpenCL or DirectCompute is that programmers can write functionally correct (although perhaps not the highest performance) code for many different platforms. Omitting double precision runs contrary to that goal and creates headaches for developers. Intel wisely added double precision floating point to the Ivy Bridge GPU, primarily for compatibility, rather than raw performance. For applications that need substantial double precision performance, solutions like Sandy Bridge-EP paired with MIC (the successor to Larrabee) or GPUs from AMD or Nvidia are more sensible.
Intel’s graphics driver is also being re-architected, although independently from the release of Ivy Bridge. Historically, Intel’s GPUs have been relatively low performance, so the CPU overhead was negligible on modern processors and the focus was on functionality rather than optimization. In comparison, AMD and Nvidia have products that are an order of magnitude faster and the driver stack must have similarly less CPU overhead. To prepare for future integrated GPUs which will be even faster, Intel is planning to reduce the driver overhead to comparable levels when measured in CPU cycles per draw call. To accomplish this, a new graphics driver architecture is expected later this year. The current driver has a layer that abstracts away the underlying GPU hardware from the OS. The newer architecture eliminates this abstraction layer, so that the driver is tailored to a specific OS and hardware combination, but lower overhead.
In earlier generations, there was only a single GPU variant and the differences were entirely due to frequency. Sandy Bridge changed that, introducing a higher performance GT2 with 12 shader cores and 60 threads and a GT1 model with 6 shader cores and 24 threads. In part, this was driven by the need to differentiate the product line. With Ivy Bridge, Intel’s GPU have been reorganized for scalability and differentiation across a number of dimensions.
Figure 1. Ivy Bridge GPU Overview
Ivy Bridge is logically partitioned into 5 different domains: global, slice common, slice, media and display. The global domain includes the bulk of the graphics and media front-end, such as the command streamer, vertex fetch and setup/clipping. The slice common domain is for the hardware that is shared by the entire shader array, such as the rasterizer, render output pipelines (ROPs) and the brand new L3 cache. The shader array is composed of multiple slices that are replicated to achieve the desired performance. A slice contains shader cores and shared resources such as the instruction cache and sampling pipeline. The media processing domain includes Intel’s programmable codecs and the video front-end that spawns media threads. Lastly, the display domain is responsible for the final output to the screen. Figure 1 shows the repartitioning of the Ivy Bridge GPU, particularly the two slices of shader cores and certain slice common blocks.
This logical repartitioning is an important step in Intel’s graphics roadmap, because it provides a range of performance. As the name suggests, parts of the shader array can be easily sliced off to save area and reduce cost. While Intel has publicly committed to dramatically improving graphics performance, low-end parts are still needed for price sensitive markets. Based on recent leaks, the next generation Haswell will have at least 3 versions of the GPU, which explains the importance of formalizing this scalability. The Ivy Bridge GT2 shader array consists of two slices, with 8 cores each, while the GT1 has a single slice. So at a high level, there are 33% more shader cores than Sandy Bridge; however, each core is individually much more powerful as well.
As with previous generations, the Gen 7 fixed function hardware and programmable shaders communicate through a messaging network. Data is also passed through the Unified Return Buffer (URB), which is part of the slice common.
Discuss (35 comments)