Unified Return Buffer
The additional FMA unit in the shader core tremendously increases performance for graphics and media workloads, but has relatively little impact on programmability. The underlying architectural features needed for OpenCL and DirectCompute support are concentrated in the memory pipeline.
OpenCL requires a shared local memory that is shared by an entire work-group for explicit communication and synchronization. High performance GPUs from AMD and Nvidia typically embed this memory in each shader core, while CPUs tend to use the L1 and L2 data caches. Rather than create a new data structure within each shader core, Intel’s architects chose to extend and enhance an existing data structure. The overall result is somewhat of a compromise in terms of performance, but highly efficient in terms of area and power.
The URB was already used in Intel’s GPUs for explicit communication and synchronization between the shader cores and fixed function hardware. In Ivy Bridge it has been extended and expanded to also act as the local memory. In Sandy Bridge, the URB was 64KB for the GT2 variants and 32KB for GT1.
The Ivy Bridge URB is significantly larger than in Sandy Bridge, and shares the 512KB array for the L3 cache. As noted previously, 256KB of the array is used for the L3 cache. The remaining 256KB is available for the two explicit memories in Ivy Bridge, the URB and the OpenCL shared local memory. For graphics only workloads with no compute, the URB will take advantage of the full 256KB, four times the capacity of Sandy Bridge. When OpenCL or DirectX compute shaders are used, 128KB is partitioned for the shared local memory, leaving 128KB for the URB. While the URB and shared local memory are logically global resources, they are physically part of the slice common domain, along with the L3 cache.
The explicit memories in Ivy Bridge can sustain 64 reads in parallel, for a total bandwidth of 256B/cycle. This is split between the URB and shared memory when in compute mode, and all available to the URB for graphics mode. The performance for atomic operations has improved by orders of magnitude as well, using the shared local memory. Sandy Bridge was limited to a single atomic per cycle, while the local memory in Ivy Bridge can execute 32 atomic operations every cycle. On some applications, Intel has seen performance gains of roughly 27× simply from better atomic handling. The URB is also used for scatter/gather as well and significantly improves performance for irregular access patterns.
The local memory in Ivy Bridge meets the basic needs of OpenCL and DirectCompute shaders, but the performance is limited compared to GPUs from AMD and Nvidia. In part, Intel is taking a gradual approach; currently compute shaders are a relatively small portion of the workload that will run on a GPU. As the importance of compute shaders grows over time though, Intel is likely to scale up the performance for local memory. It is quite possible that in future generations, the local memory will move from a globally shared structure to a more distributed resource to scale better with the number of shader cores.
The data port provides access for the Gen 7 GPU to the ring bus that ties together the entire Sandy Bridge chip. The design has been carried over from Sandy Bridge, although the cache sharing between the GPU and CPU has been subtly tweaked. The ring bus operates at the CPU frequency, which is essentially 3× the GPU clock. So from the GPU’s perspective, the data port gets 3x32B accesses per cycle.
The only real changes were in the arbitration techniques for the LLC. Previously in Sandy Bridge, the driver allocated 128KB ways in the cache for the GPU. Reserving part of the LLC for the GPU and part for the CPU was a straight forward way to provide basic quality-of-service guarantees. After analyzing a great deal of data though, Intel’s architects determined that this did relatively little to improve performance. Workloads seemed to either be solely bottlenecked by the CPU or GPU, but not both. The conclusion was that the complexity of modifying the LLC replacement algorithms was not worth the benefits and Ivy Bridge has a simple dynamic sharing scheme that is based on demand.
For a system where the CPU and GPU are mostly separate entities, this seems reasonable. However, as the two become more tightly integrated, heterogeneous applications will become more common. When workloads start to share data between the two in a fine-grained manner, quality-of-service will become much more essential. While allocating the LLC at way granularity may be the wrong approach, this is an area where there is no real industry consensus and it deserves considerable research going forward.
Discuss (32 comments)