In the Fusion architecture, traditional CPU and GPU memory requests behave normally. The CPU requests are optimized for latency, whilst the GPU accesses are tuned for high throughput. The real changes are for cross-domain accesses (e.g. CPU accessing GPU memory and vica versa), which mostly go through memory and eliminate any copying of data between different address spaces in DRAM. The communication between the two devices still strongly emphasizes the historical pipeline where data is flowing from the CPU to the GPU, rather than a truly peer to peer model with data going in both directions.
The CPU sees GPU memory as uncacheable, so stores are sent to the write combining buffer. When the data is flushed from the WCBs, it goes across the Onion bus to the GPU. The GPU can write the data to memory and achieve roughly 8GB/s – whereas using PCI-Express on a discrete card the throughput was limited to around 6GB/s.
Figure 3 – Llano and Zacate
The CPU is not intended to read from GPU memory and the performance is singularly poor because of the necessary synchronization. The CPU regards the frame buffer as uncacheable memory and must first use the Onion bus to flush pending GPU writes to memory. After all pending writes have cleared, the CPU read can occur safely. Only a single such transaction may be in flight at once, another factor that contributes to poor performance for this type of communication.
The GPU can access uncacheable system memory using the Garlic bus, but the memory must be pinned since there is no demand based paging for graphics (yet). System memory is generally slower than frame buffer memory because there is no interleaving (12GB/s versus 17GB/s for framebuffer). However, it is substantially faster than accessing cacheable shared memory, since there is no coherency overhead. For example, this approach could be used to read in data from the CPU to start an OpenCL kernel on the GPU.
Cacheable system memory is also accessible to the GPU through the Onion bus. As with all memory, it must first be pinned ahead of time. First, the memory request is sent from the GPU over Onion to the coherent domain. Once the request is in the ordering queue, it will issue coherency probes to the CPU cores. If the coherency probe hits in an L1 or L2 cache, the requested cache lines are sent directly back to the GPU over Onion – without ever touching memory. However, this is very infrequent because the L2 caches will tend to quickly evict data that is not regularly used by the CPUs. The common case is that the coherency probe indicates that the data is not cached and that memory has the most recent version. The memory access will continue and if any data is returned, it will be sent to the GPU through the Onion bus.
In contrast, Sandy Bridge simplifies the memory controller and relies on the L3 cache for integration. The GPU and each CPU core sit on the 512-bit ring that operates at the core clock speed and connects to the L3 cache and memory controller. A portion of the L3 cache can be allocated to the GPU and accessed with very high bandwidth (and relaxed consistency). For example, the L3 cache is used for spilling register data from threads in the GPU. Generally, the GPU just looks like another coherency agent on the bus fabric – although precise details have not been released since the Sandy Bridge GPU is not fully programmable.
Sandy Bridge GPU accesses to system memory are fairly simple and probe the inclusive L3 cache (or flush any pending uncacheable writes as needed) when they traverse the ring bus, to maintain coherency. However, most GPU requests will bypass the L3 cache altogether (e.g. texture reads) and go straight to the memory controller. The memory controller does not seem to distinguish between CPU and GPU accesses and separately schedule them. While this does leave some performance on the table compared to AMD’s optimizations, it also substantially simplifies the design of the memory controller.
The Sandy Bridge CPU cores can access data and synchronize with the GPU’s portion of the L3 cache. For example, the CPU can write graphics commands into the GPU’s L3 cache, which the GPU then reads. The GPU can also explicitly flush data back to the L3 cache for the CPU to access with very high performance (e.g. for offloading from the GPU to CPU). Passing data between the GPU and CPU through the cache (instead of memory) is one area where Intel’s GPU integration is substantially ahead of AMD. In many respects, the communication model is much more bi-directional rather than the traditional one-way flow in a graphics pipeline.
Discuss (85 comments)