The Gen 6 architecture is fundamentally a very different animal from other GPUs. At a high level, AMD’s Cayman and Nvidia’s Fermi/GF104 are similar to each other, with comparable balances of resources. More importantly, they are all designed to scale from $500 discrete cards down to $50 budget options and eventually integrated graphics. This is roughly a difference of 12X performance from the low-end to the high-end. The overall balance in Gen 6 is very distinct and clearly focused on a lower performance and more power efficient design point where the market is narrow because integration with the CPU is assumed rather than optional. Ironically, as Nvidia and AMD are emphasizing their programmable hardware, Sandy Bridge has more dedicated hardware than its predecessor.
The Gen 6 architecture in Sandy Bridge is a highly programmable graphics pipeline. However, this flexibility is only exposed to Intel’s graphics driver and is not accessible to ordinary developers. While it is compatible with DX10.1 and OpenGL 3.0, it lacks the features necessary for OpenCL or DirectX Compute Shaders. To a large extent, this is because Intel designed the Sandy Bridge graphics before OpenCL standard was finalized.
Ironically, Sandy Bridge is one of the most advanced designs for unifying the CPU, GPU and fixed function hardware, thanks to a shared and tightly coupled last level cache. The media pipeline is highly programmable and exposed to developers through an API. This does not make up for the programmability deficiencies in the GPU. But it demonstrates that Intel’s commitment to programmability is real, especially for mass market applications such as media encoding and decoding. Intel’s programmable graphics efforts are primarily held back by project schedules, rather than any technical challenges.
One of the novel aspects of Intel’s graphics architecture is that each thread (i.e. shader program) can have a different software execution model. The goal is to efficiently handle data that is stored in both array-of-structures (AOS e.g. vertices) and structure-of-arrays format (SOA e.g. pixels). This in turn requires a very interesting set of hardware design choices.
There are several operating modes, each with a different vector width (n) and number of channels or program flows (m), often described as SIMDnxm. For example, SIMD4x1 or 4×2 are used for geometry or vertex shaders, which operate on 4-component vertices. In contrast, SIMD1x8 or SIMD1x16 is well suited for pixel or media shaders that work on one color component of 8 or 16 pixels at once. To put this terminology into a familiar context, an Nvidia warp is conceptually SIMD1x32, with 32 channels and a scalar operation for each channel. AMD’s wavefronts similarly are 16 channels, with a VLIW4 per channel. The Gen architecture is particularly interesting because each instruction can have a different number of channels and a different width, as shown in Figure 1.
Vertex and geometry threads generally operate on 2 vertices at a time and use SIMD4x2. Pixel threads are spawned by the fixed function rasterizer and can shade 2, 4 or 8 quads of pixels (which Intel refers to as a subspan) using SIMD1x8 or SIMD1x16. The size of the pixel threads is determined by software and typically based on overhead, register availability, instruction cache pressure and other resource constraints. Media threads are created by the video front-end, but can readily spawn child threads and tend to look more like pixel threads.
Figure 1 – AMD, Intel and Nvidia SIMD Execution Models
The actual execution of the programmable portions of 3D and media workloads is managed by a thread dispatcher. Threads are managed and sent to the shader array based on available resources with extensive thread level scoreboarding to correctly handle any dependencies. Executing threads (as well as fixed function hardware) can spawn other threads, which requires a bit more complex management to track the relationships and dependencies – another motivator for the thread scoreboarding. The older Ironlake would simply block any newer threads from dispatching to the shader array to avoid dependencies. For instance, this was used to prevent overlapping pixels from writing their results out in the wrong order.
Sandy Bridge’s more sophisticated thread scoreboarding is non-blocking. The thread dispatcher keeps track of the ordering and dependencies between threads so that they can dispatch and execute in parallel. To maintain correctness though, the writes to the frame buffer (through the ROP) must occur in-order. So the newer thread will be blocked from writing back until the older one has first written out its results. Incidentally, this capabilitiy was primarily added for media acceleration, but is also highly beneficial for programmable workloads with thread dependencies.
Discuss (65 comments)