Shader Front End
The front-end for the Ivy Bridge shader cores has been substantially re-organized. The earlier Sandy Bridge design had a 4KB dedicated L1 instruction cache for each core, which was backed by a larger shared L2 instruction cache. However, as Intel scaled up the number of cores in Ivy Bridge, this became less efficient. Many shaders will execute across all cores simultaneously, so the instructions end up getting cached redundantly.
The Ivy Bridge instruction cache is a much larger 32KB design, but it is shared by 8 cores in each slice. Compared to the previous generation, the amount of cache is the same for a group of 8 cores, but there is no replication so more instructions can be cached simultaneously. The shared L1 instruction caches are now backed by the much larger L3 cache.
Each thread maintains a separate buffer of fetched instructions. Since the 8 cores are sharing the L1 instruction cache, the instruction buffers increased as well. The details of the Ivy Bridge instruction buffers were not disclosed. However, assuming that each thread can fetch every 8 cycles, it is likely that each thread’s buffer holds no less than 16 instructions to cover this latency.
Figure 2. Shader Front-end Comparison
The single biggest change to the Ivy Bridge shader cores is essentially doubling the execution resources. Because the overall memory latency has not significantly changed, Intel’s architects had to substantially increase the number of threads to sustain twice the FLOP/s. Sandy Bridge GT2 variants had 5 threads per core, while the GT1 versions featured 4. Ivy Bridge has 8 threads, more or less consistent with the additional compute power.
The register files in the Gen 7 core were also rearchitected. The most significant change is that Intel eliminated the message register file (MRF). Previously, the MRF was used by the shader cores to send message to the rest of the GPU, including the fixed function hardware and the other cores. In Ivy Bridge, these registers have been removed, and messages can be sent directly from the general purpose register file (GRF).
As discussed in our earlier article on Sandy Bridge, threads can execute in one of several execution modes. Vertex shading commonly uses SIMD4x2, with 4 data elements from 2 vertices. Pixel shading is SIMD1x8 or SIMD1x16 (aka SIMD8 or SIMD16), operating on a single color from 8 or 16 pixels simultaneously. Media shaders are similar to pixel shaders, except they are packed even more densely with 8-bit data, rather than the 32-bit data used in graphics shaders. To support all these different execution modes, the GRF is incredibly versatile.
Registers are each 256-bits wide, which is perfectly suited for SIMD2x4 or SIMD8. In a 16B aligned mode, instructions operate on 4-component RGBA data, with source swizzling and destination masking. In a 1B aligned mode, instructions use region-based addressing to perform a 2-dimension gather from the register file and swizzling and destination masking are disabled. This is critical for good media performance, where 1B data is packed together for maximum density. Collectively, these two addressing modes also simplify converting from AOS to SOA data structures.
Each thread is allocated 128 general purpose registers, so the GRF has expanded to 32KB to handle 8 threads. The GRF has also been enhanced to handle larger 8B accesses that are necessary for double precision computation. A separate architectural register file (ARF) contains the control information and special purpose registers for each thread, although it has been omitted from the figure above.
Discuss (35 comments)