The most novel and interesting part of Llano is not the CPU or the GPU. Both of those components were re-used, specifically to avoid any complexity and the associated risks. The software and physical integration is the key to Fusion, and the area where AMD focused the most energy. Incidentally, this architecture is shared across both Llano and the netbook oriented processors based on Bobcat.
At boot, Llano partitions the physical memory into two regions – up to 512MB graphics memory and the rest is system memory (for the CPU). The CPU and GPU have separate virtual memory systems: the CPU uses demand based paging, while the GPU paging is software scheduled (by the driver or OS). However, the OS can pin memory translations in both domains to simplify passing data.
The CPU’s cacheable memory relies on AMD’s MOESI protocol, and has the standard x86 consistency semantics with strong ordering of memory references. The CPU’s uncacheable memory also has the same behavior as before. The GPU memory region behaves in a totally different fashion. By default it has relaxed consistency (and thus is not x86 coherent), so that loads and stores can be freely re-ordered for higher memory bandwidth.
The GPU can send coherent memory requests to CPU memory that is pinned, but the CPU relies on the driver and explicit synchronization to communicate with the GPU. Coherent requests from the GPU are substantially lower performance, but are the exception rather than the rule. The net effect is a model where data can be moved through memory without copying and the GPU still maintains good performance by default. Note that Zacate has the same programming model, albeit with lower performance targets.
Fusion ties together the CPU and GPU through the northbridge and memory controller, and most data is passed through memory. The power for the two components is also separately managed. Figure 2 compares the integration in Fusion (specifically Llano) and Sandy Bridge.
Figure 2 – Fusion and Sandy Bridge CPU and GPU integration
The Fusion GPUs have a dedicated non-coherent interface to the memory controller (the Radeon Memory Bus or Garlic, shown with a dotted line) for commands and data. The bus is 256-bits (32B) wide in each direction and is replicated for each memory channel (2x32B read and 2x32B write for Llano, half for Zacate). Garlic operates on the Northbridge clock – up to 720MHz for notebook versions of Llano and 492MHz for Zacate. This is a factor of 2-3X more bandwidth than memory can provide (roughly 17GB/s measured), which is needed to handle bursts of memory transactions (e.g. texture reads).
The GPU has a separate interface for sending memory requests that target the coherent system memory. The Fusion Control Link (or Onion) is a 128-bit (16B) bi-directional bus that feeds into a memory ordering queue shared with the coherent requests from each of the 4 cores. Onion runs at up to 650MHz for notebook variants of Llano (10.4GB/s read + 10.4GB/s write) and 492MHz for Zacate. An arbiter in the IFQ is responsible for selecting coherent requests (based on memory ordering) to send to the memory controller. Desktop versions of Llano will probably run Garlic and Onion faster still, given the extra power budget.
The memory controller arbitrates between coherent (i.e. ordered) and non-coherent accesses to memory. Llano has two 64-bit channels of DDR3 memory that must operate independently, while the smaller Fusion cousin only has a single channel. The GPU memory is interleaved across both channels for maximum streaming bandwidth and requests will close DRAM pages after an access. In contrast, system memory is optimized for latency and locality; contiguous requests will tend to stay to one memory channel and keep DRAM pages open. The memory can run up to 1.86GT/s for a total of 29.8GB/s memory bandwidth on Llano. It also contains an improved hardware prefetcher that tracks 8 different strides or sequence of strides and speculatively fills into the memory controller (rather than the caches).
In contrast, Sandy Bridge has tigher integration – using the on-die ring interconnect and L3 cache. Data is passed through the ring interconnect, but can be shared either through the cache or memory. The ring interconnect is 32B wide with 6 agents and operates at the core frequency (>3GHz). Data usually coming from either the 4 slices of the L3 cache or the memory controller, which resides in the system agent. The peak bandwidth is over 400GB/s, but the practical bandwidth since many accesses have to go through multiple stops on the ring interconnect. The Sandy Bridge power management is also fully unified for both CPU and GPU, so that when one is idle, the other may ‘borrow’ the thermal headroom.
Discuss (85 comments)