Render Output Pipeline
The other prominent component of the memory hierarchy is the render output pipeline (ROP), which is capable of both reads and writes and resides in the data port. Sandy Bridge has a single ROP pipeline, again because the performance only scales modestly between different parts. In contrast, GPUs from AMD and Nvidia tend to partition the ROPs, with a pipeline for each channel of external memory. As shown in Figure 7, Llano has two ROP partitions.
The ROP is responsible for writing out render targets and performs a variety of critical graphics operations including alpha testing, stencil testing, depth testing and blending the pixel output. Like most GPUs, the Gen 6 ROP is generally optimized for writing out data, rather than reading, and must include many related functions such atomic operations. A message to the ROP typically contains 4 quads (16 pixels total) that will be written back to one or more render targets.
Both Ironlake and Sandy Bridge include render caches that are used for read/write operations, particularly in the ROP. The depth and color caches are respectively 32KB and 8KB, both set-associative. The output from these caches are eventually sent over the ring interconnect and to Sandy Bridge’s L3 cache and/or memory controller. The render caches are not coherent with the texture caches by default. To safely read back data, the render caches must be explicitly flushed.
Figure 7 – Render Output Pipeline Comparison
The ROP in Sandy Bridge was also substantially improved over the previous generation with better performance and higher quality graphics options. As an example of the trend towards fixed function graphics hardware at Intel, alpha coverage generation shifted from threads on the shader cores (in Ironlake) to dedicated hardware in the ROP (for Sandy Bridge).
Ironlake introduced hierarchical Z compression, to save memory bandwidth when handling depth buffers, and a fast Z-clear as well. The Gen 6 ROP has significantly better hierarchical Z performance, by operating on larger tiles and further reduced the cost of clearing different buffers. Early Z-testing (prior to pixel shading) is now mandatory, rather than optional, as it was for Ironlake.
The most significant change in the Sandy Bridge ROP is in image quality, rather than performance. Previous generations did not support anti-aliasing, which helps to smooth out lines and jagged edges in rendered images. Sandy Bridge has 2X and 4X multi-sample anti-aliasing (MSAA) with 32-bit FP blending, which is required for DirectX 10.1, and has been standard for AMD and Nvidia for half a decade or more. The ROP can write four 32-bit pixels per clock, for a total throughput of 16B/cycle – however, multi-sampling reduces the performance considerably. The MSAA blending hardware can also be used for atomic operations, which will be handy for future versions that are OpenCL compliant.
The data port is responsible for all memory accesses outside the texturing pipeline. Most prominently this includes the ROP and render caches. But the data port also encompasses the constant cache and access to the ring interconnect (i.e. the shared LLC and memory controller). As with most shared resources, the data port is accessed through the messaging framework. Sandy Bridge is the first GPU with access to the LLC, and it is a tremendous advantage over Ironlake.
One of the most important changes in the data port is the memory ordering. Previously, the data port had no ordering between messages; software was responsible for ensuring that two messages did not attempt to simultaneously read and write the same location. Sandy Bridge moves a step in the right direction and guarantees that read and write commands from each thread will be handled in-order. There is still no hardware ordering between different threads by default, but that is normal for most GPUs. The stronger memory ordering model is important for future generations that will have OpenCL and Direct Compute – an in-order model is much more natural for developers. It is also a boon to tighter integration, since it more closely matches the x86 ordering model.
The Sandy Bridge data port has several new capabilities. The first is an unaligned 16B block read, which can access 16B, 32B, 64B or 128B of contiguous data and writes back to 1-4 GRF entries (depending on the total size). The second takes advantage of the new blending hardware in the ROP for atomic operations. The message executes 8 atomic operations on 32-bit data, including arithmetic and logic, compare and exchange and min/max, and writes the results back to 8 different locations memory. The writes can be totally non-contiguous in memory, and the current generation will not attempt to coalesce the accesses. This is another forward looking change that is necessary for programmability – both standards require some form of atomic operation.
Discuss (65 comments)