Introduction to Llano
When AMD acquired ATI in 2006, the goal was a unification of microprocessors, graphics and chipsets to produce a company that could truly offer a complete platform. It took roughly 5 years for the first products to integrate the CPU and GPU together, which AMD billed as Fusion.
AMD’s Fusion is a flexible architecture that has two incarnations currently. In January 2011, AMD released Zacate – which aims for low power, rather than high performance. The low-power Bobcat CPU is integrated with a low-end DX11 GPU on TSMC’s 40nm process. The resulting chips range from 9-18W and are suitable for low-end or power sensitive notebooks, netbooks and desktops. Ostensibly there are 6W versions in the near future for tablets, but the power consumption seems too high especially given that a separate I/O hub is required – but tablets based on a 28nm shrink seem eminently reasonable.
Llano is AMD’s performance oriented Fusion microprocessor – intended for the mainstream notebook and desktop market. The chip (shown in Figure 1) is 228mm2 and fabricated in Global Foundries 32nm SOI high-k/metal gate process, which uses a gate first technique. It is the first time that AMD has implemented a PC graphics processor in SOI, which is a fairly substantial change from the traditional ASIC-oriented TSMC bulk process. AMD’s architects chose to rely on older and proven components that could be smoothly migrated to the 32nm process with minimal risk. The real emphasis was on the integration of these components – the programming model, overall system architecture and power consumption.
Figure 1 – Llano Die Photo
Notebook versions of Llano were launched as the Fusion A-series in June of 2011. Rumors suggest that yield problems with the gate first manufacturing process at Global Foundries delayed products by 3 months or more, which would also explain why there are also no desktop variants of Llano currently available. Notebooks are by far the more lucrative and important market. This article describes the Llano microprocessor, first discussing the key components, then AMD’s Fusion architecture for CPU/GPU integration and last the results for Llano products and expected future evolution.
Llano is a quad-core microprocessor that retains the Family 10h microarchitecture (which includes Barcelona, Shanghai, Istanbul and Magny-Cours). There are nearly a half dozen minor tweaks to improve performance that cumulatively add about 6% to IPC, according to AMD architects. However, major changes such as AVX support, will have to wait for Bulldozer though.
The most significant improvement in Llano is the larger 1MB, 16-way associative L2 caches. Essentially, AMD opted to eliminate the shared L3 cache and instead make the private L2 caches larger. The L1D cache is implemented with 8T cells for low voltage operation and reliability (the core is designed for 0.8V-1.3V). The number of outstanding memory requests from each core remained the same – 8 pending misses, and the write combining buffer (WCB – used for stores to uncacheable memory) is still 4 cache lines.
There are also modest improvements to the execution resources within each core. The renaming window grew slightly from 72 to 84 macro-ops. The schedulers received more attention and improvement. The integer scheduler went from 24 to 30 entries, while the FP scheduler is up to 42 entries. A new divider was added to the third integer pipeline, and certain FP instructions execute faster.
Perhaps most importantly, Llano’s power management is a huge step forward. Each core and L2 cache is power gated, using a ring of NFETs. The cores also have a dynamic voltage and frequency system (DVFS) that uses over 100 performance counters to estimate the power consumption and adjust the frequency with digital dividers.
Each core is 17.7mm2 including the L2 cache and power gating. AMD is actually spending about the same area on the CPU cores as Sandy Bridge (which uses ~74mm2 for 4 cores). However, Intel also uses another 43mm2 for the 8MB L3 cache and ring interconnect.
Llano’s GPU is clearly the focal point of AMD’s efforts, which is natural given their expertise in graphics. The GPU occupies about 35% of the total die area – nearly 80mm2. While few changes were made to the graphics hardware, the fixed function media block was updated to use the more modern UVD3.
The GPU is based on Cypress (rather than the newer Cayman) and the fixed function graphics pipeline can setup 1 triangle per clock, as with its larger desktop cousins. The programmable shaders in Cypress are the older VLIW5 design, with 4 general purpose 32-bit execution pipelines and a fifth for special functions (e.g. transcendentals). Each core in the GPU contains a 16-wide SIMD array of these VLIW5 shaders, plus an 8KB L1 texture cache with filtering hardware and an explicitly addressed 32KB local data share. The shader array is roughly one quarter the size of the high-end Cypress – 5 SIMDs and a total of 400 single precision execution pipelines. As with Cypress, each SIMD is shared between 8 work-groups (up to 32 wavefronts), for a total of 40 work-groups across the whole GPU.
The memory partitions are also derived from Cypress, but significantly scaled back. There are two memory partitions, each one containing a 64KB slice of the L2 texture cache, the 4KB Z cache and 16KB color caches and associated ROPs.
One of the unique benefits of using a scaled down Cypress GPU is that the integrated graphics in Llano can theoretically be coupled to a discrete card for a sort of hybrid Crossfire. Llano has 3 PCI-E gen 2 links, which are each 1B wide, so there is room for a x16 graphics card and another I/O device to directly connect. There are limits though – the performance of the two GPUs must be within a factor of ~2-4, otherwise the imbalance becomes too difficult to manage for the driver.
Discuss (85 comments)