The two largest components of graphics performance are the compute resources of the shader array and memory and texturing bandwidth, which must be scaled up in tandem. The Sandy Bridge shader array shared a single texturing pipeline between 12 cores. To complement the more powerful shaders in Ivy Bridge, Intel’s architects have distributed two sampling pipelines, one to each slice of the shader array.
As with Intel’s previous GPUs, there is no dedicated memory hierarchy within each shader core. The Ivy Bridge shader cores do not even contain true address generation or translation. When a texture fetch or memory accesses is needed, an integer address is sent through the messaging network to the shared sampling pipelines. The sampling pipelines actually perform all the address generation and translation. Virtual memory is mapped using 4KB pages for compatibility with the x86 memory model, with either tiled or linear pages. The texture sampler also coalesces accesses to take advantage of locality.
The individual sampling pipelines are more or less identical to the previous generation. The Gen 7 samplers support the new BC6H/BC7 texture formats, which are required for DX11. The sampler can receive 4 pixel input addresses and generate 32 texel addresses. The read-only L1 texture caches in the samplers are the same size, 4KB, but the L2 texture caches have increased from 16KB to 24KB. This means that each sampler is shared by 8 Gen 7 shader cores, increasing the overall bandwidth by 50%. Of course, the Gen 7 cores are about 2× more powerful and the number of cores is up by 50%, so the actual ratio of bytes/FLOP has fallen. Doubling the sampling pipelines is also quite helpful for media shader performance, especially at high resolutions.
Figure 4. Memory Pipeline Comparison
Additionally, Intel has added hardware and done extensive work in the driver stack to improve the anisotropic filtering algorithms. The image quality in Sandy Bridge suffered from an algorithm that produced different texture quality depending on the viewing angle. In synthetic tests, the image quality was quite poor compared to modern GPUs, although the impact on real games is unclear.
To put this in context, both AMD and Nvidia had problems of this nature with earlier products, especially prior to DX10. However, the two discrete GPU vendors have been tweaking the filtering algorithms for over 5 years, both for synthetic tests and real graphics applications. Ivy Bridge eliminates the most egregious problems and the results on synthetic tests are consistent with industry norms.
One of the fundamental challenges for integrated graphics is memory bandwidth. High-end graphics cards have dedicated memory controllers with expensive GDDR5 and recently exceeded 250GB/s of bandwidth. Integrated graphics must share the memory controllers with the CPU and rely on much less expensive DDR3 that is limited to around 34GB/s. Doubling the sampling pipelines in Ivy Bridge is necessary to scale up graphics performance, but has system level implications. If doubling the samplers merely doubles the memory bandwidth needs, then the net result will not be that impressive.
To address this issue, the Ivy Bridge GPU incorporates a high bandwidth L3 cache that is shared by the entire shader array. The nomenclature is a little complicated, since the L3 cache serves so many purposes. It backs the L1 and L2 texture caches, the L1 instruction caches and also holds constants. Note that the L3 is a graphics only cache, and separate from Ivy Bridge’s last level cache (LLC). So technically speaking, there are 4 levels of caching for the Ivy Bridge GPU. The actual array is physically 512KB, but it is partitioned into 2-3 sections and shared with other parts of the GPU; the L3 cache takes up half the array.
The L3 cache is 256KB and 32-way associative with 64B lines. It is implemented as 4 banks, each containing 32 sets and delivering a full cache line for an aggregate 256B/cycle. The replacement policy is a pseudo-LRU algorithm and the L3 is way partitioned between data, textures, instructions and constants. The main motivation for the L3 cache is to absorb the bandwidth required by the texture pipeline in Ivy Bridge. As an added benefit, it can quickly deliver instructions to the two L1 caches, which reduces the penalties from contention and enables more efficient L1 instruction caches that are larger and shared by more cores. It also substantially increases power efficiency, since any hit in the L3 will not have to traverse the ring bus and probe the LLC.
Render Output Pipeline
The render output pipeline in Ivy Bridge is mostly unchanged. Like the L3 cache and the rasterizer, it is shared across the entire GPU. The ROPs contain a 32KB depth cache and 8KB color cache and can write out four 32-bit pixels per clock 2× and 4× multi-sample anti-aliasing are supported, albeit with a decrease in throughput due to the extra work. One minor improvement is that the scoreboarding for the ROP has been tweaked to reduce the bandwidth requirements.
Discuss (32 comments)