One of the largest improvements in Sandy Bridge is the move towards fixed function hardware, in contrast to the previous generation’s reliance on software. Dedicated hardware has substantially better area and power efficiency than a software implementation that executes on programmable hardware. For example, a benchmark at the Tech Report showed that Intel’s AES hardware boosts performance by nearly a factor of 8X on largely similar CPUs.
Intel’s architects also claim that moving algorithms and workloads from software to hardware removes considerable amounts of code from their graphics drivers. This means a marginally faster and more efficient driver, but more importantly, it simplifies software development and testing by eliminating complexity. Of course, the hardware must still be tested and validated, but that is frankly a more tractable and familiar problem to Intel than writing high performance drivers.
Dedicated hardware is not without costs and trade-offs though. One subtle drawback is that on-chip buffering is necessary – and must be sized based on the performance of the overall pipeline and fixed function blocks. Intel’s graphics architecture stores data for the fixed function units in the URB, which is shared across the whole chip for efficiency.
Figure 3 – Sandy Bridge and Ironlake GPU Overview
As shown in Figure 3, the previous generation uses software threads to assist with the clipping and setup stages of the 3D pipeline. Ironlake’s clipping test is done in hardware for common cases, but more complex testing requires software. To actually clip a vertex, the GPU will spawn a clipping thread that is dispatched to the shader array. The Sandy Bridge clipper is much more powerful and handles both testing and actual vertex clipping. This dedicated hardware replaces and eliminates any software clipping.
The setup stage is responsible for assembling clipped vertices into 3D objects for rasterization. Ironlake spawns setup threads that calculate attribute interpolation for Z and 1/W, while hardware interpolates X & Y. Sandy Bridge moves the setup phase entirely into fixed functions, again reducing the burden on the shader array and drivers.
Sandy Bridge can setup and rasterize a triangle every 4 clock cycles. This is fairly slow compared to AMD and Nvidia GPUs, which typically rasterize 1 triangle per clock, although newer models can achieve 2-4 triangles/cycle. In reality, the Gen 6 rasterizer runs at roughly twice the clock frequency of discrete GPU so the performance gap is lower than it appears.
Intel also added substantial fixed function hardware for media decoding and encoding. The Multi-Format Codec (MFX) in Sandy Bridge has full hardware decoding for MPEG2, VC1 and AVC to reduce power consumption, whereas Ironlake performed motion compensation and filter deblocking in software. The encoding is fully accelerated for AVC and H.264, although this uses a combination of fixed function hardware and the programmable shader cores.
The Sandy Bridge shader array has 12 cores for the high-end GT2, and 6 cores for the GT1 variant. The shader array is organized into rows of 3 cores, and each row shares an L1 instructuction cache. The older Gen 5 design also shares a transcendental math unit between each row. Collectively the entire GPU shares an L2 instruction cache, the URB, a texture sampling pipeline and a raster output pipeline. As previously described, the shader cores are very flexible and can execute in either a SIMD or scalar fashion.
While Sandy Bridge has the same number of cores as Ironlake (12), the microarchitectures are substantially different. The newer cores have a much more powerful instruction set, more resources and better access to special purpose hardware. Overall the performance per core is roughly double. Yet another example of how the number of cores in a GPU (or CPU) is a misleading and nearly irrelevant metric.
The thread dispatcher is responsible for sending various types of threads (vertex, geometry, pixel, media) to the programmable graphics cores for execution. As with all GPUs, the cores are multi-threaded to hide latency. Sandy Bridge’s GT2 cores can have 5 threads in-flight at once, for a total of 60 threads across the GPU. The lower-end GT1 cores are limited to 4 threads each, with half the number of cores overall. The older Ironlake design actually supports 6 threads, but Intel reduced this for Sandy Bridge because they moved all clipping and setup from programmable threads into fixed function hardware.
Threads are primarily instantiated by fixed function hardware, although media threads can also spawn child threads. The data needed to start a thread is typically buffered in the URB and then sent to the thread dispatcher. Thread readiness is based on input and output requirements and resource availability, including constants, the URB, scratch space and the actual shader cores. For example, two vertices are usually required to dispatch a vertex thread, and a geometry thread must have all vertices available. Similarly, pixel threads are dispatched to shade 2, 4 or 8 pixel quads.
All threads are classified as high or low priority, and the dispatcher uses a round-robin algorithm within each priority class. When a thread is selected, the dispatcher will assign it to a core and also send the input data which is copied into registers. Upon completion, threads will send a termination message to the dispatcher.
Discuss (65 comments)