Llano Power Management Update
AMD’s first presentation at Hot Chips concerned Llano, AMD’s first mainstream Fusion product, which is already on the market. It is well known that Llano has industry leading integrated graphics, reasonable media decoding and relatively poor CPU performance. Most of the new information about Llano focused on the power management system and was a very pleasant surprise. Incidentally, this was back-to-back with a presentation on Sandy Bridge’s power management, making for an interesting contrast. The two designs use conceptually similar techniques and model power consumption as the sum of dynamic and static power. Dynamic power consumption is calculated based on digital activity measurements, which produces fairly accurate estimates. They both use a dynamic voltage and frequency scaling (DVFS) system that can increase the power draw beyond the TDP (thermal limit) for a short period of time.
One of the early concerns we discussed in our Fusion and Llano analysis was that AMD’s power management seemed relatively primitive. For example, they had not mentioned dynamically shifting power between the CPU cores and GPU, or exceeding the chip TDP. In fact, AMD had been very open that the cores and GPU all have individual TDPs and power draw limitations. This is far from ideal as power and thermal management are package level issues and there is a risk of missing the forest for the trees. It turns out that these concerns are overblown and that AMD has a viable approach to manage system power, despite several challenges.
Power Challenges and Solutions
The first complication is that Llano does not collect package level power measurements. Instead it relies on digital activity measurements and estimates of individual components (e.g. cores, GPU, northbridge). This makes it complicated to determine how close Llano is operating to the true thermal and power limits and optimize across all the different components including the CPUs, GPU, memory controllers and I/O. Llano calculates the energy margin by summing the estimated power of the CPUs and GPU and comparing it against the chip-level TDP over time to see if there is available headroom. So periods of inactivity can create headroom for future activity.
To manage power effectively, Llano has a system that relies on trading power and thermal credits between the CPU and GPU. When one block is idle, other components can borrow a fixed amount of power and raise their individual TDPs to run faster. This technique of shifting ‘power credits’ between the CPU cores and GPU is how AMD balances system power and performance. Incidentally, the Llano power controller prioritizes the GPU ahead of the CPU.
One novel trick is that Llano’s algorithms recognize that idle blocks in the system will act as indirect thermal conduits for active components. Heat will flow from hotter (active) to cooler (idle) regions in the chip and the idle region will still dissipate heat into the heat sink, effectively creating a greater surface area for cooling. The credit-based power management and especially the indirect thermal dissipation are substantially more advanced than we had described in our earlier article on Llano and it is important to acknowledge this oversight.
Both the CPU cores and GPU are dynamically power gated. The CPU cores and L2 caches use the C6 state and a ring of power gates around each block. The wake-up latency for the Llano CPU cores is reported to be 30 microseconds, which will presumably improve in future generations. The rest of the system uses distributed power gates, rather than a peripheral ring. The graphics driver sets an idle threshold – when exceeded, the GPU will dynamically power gate itself off. The GPU memory controller (separate from the DDR3 controller) will dynamically power gate if the DRAM is in self-refresh. The video decoding and PCI-E x16 lane are statically power gated by the graphics driver and BIOS respectively. There is also a package level power gating (PC6), which lowers or removes the power supply and takes roughly 100 microseconds to wake-up. Timer and system interrupts are monitored to avoid substantially reducing performance due to exit and entry latency for power gating.
One of the problems with silicon-on-insulator (SOI) is that the buried oxide layer is both an electrical and thermal insulator. Electrical insulation is good and gives modest performance gains, perhaps 5-7% at a modern node. Thermal insulation is quite problematic and is the cause of SOI’s self-heating and hysterysis problems. Designing thermal sensors on SOI is quite challenging because of the heating issues. As we discussed in our previous article on cooling and power, leakage is exponentially related to temperature. Without fine-grained temperature measurements Llano’s static power estimates are inherently conservative and leave performance on the table. Undoubtedly future generations will rectify this problem.
Another significant challenge for Llano’s DVFS is that the granularity of frequency adjustment is very limited. In fact, there is only a single boost frequency available to Llano, although each core can operate at different frequencies (as with previous designs). The gap between base and peak frequency is typically quite large, around 30-50% for Llano notebook products and 15% for desktop variants. A large frequency range is essential for the DVFS and good performance for consumer products. However it is undesirable for the frequency to be very coarse-grained without smaller intermediate steps, because that tends to require big steps in voltage. While an intermediate frequency can be emulated using time averaging, it ends up burning more power. For example, to run halfway between the base and peak frequency, Llano could simply spend half its time at peak and half at base. However, frequency is linear with voltage power scales with the square of voltage – so averaging wastes power, compared to a hypothetical intermediate voltage and frequency point. For example, if the peak frequency is 30% over the baseline, then emulating a 15% elevated frequency will waste 7% of the CPU power due to poor voltage scaling. This problem could be easily solved with 5-10% voltage and frequency steps.
AMD also provided some interesting Llano performance numbers. The various changes in the aging Stars microarchitecture seem to have increased the IPC by around 5%, mostly due to a larger out-of-order window, better pre-fetching and double sized L2 DTLB. In addition, the DVFS demonstrated gains of around 10-15% using CPU bound benchmarks, where the GPU was presumably idle.
Conclusions
The key take away is that Llano’s power management is significantly more advanced than previously indicated and includes a few novel features, such as taking advantage of indirect thermal dissipation. Despite several implementation hurdles, Llano shares power between the CPUs and GPU to optimize for a given workload, which is the most important issue. While there are shortcomings in Llano’s approach, this is quite understandable. The Stars CPU core and the VLIW5 GPU are old and known designs, with minimal power management. They were selected for fast time to market and minimal risk, since this was the first GPU to use SOI. Moreover, AMD did not want to invest the resources in significant power management improvements to a CPU core and GPU that are only used in a single product generation and cannot be carried forward.
Instead, the more advanced power management features will be used in the next generation where they can be an integral part of the microarchitecture. Bulldozer features a number of design techniques such as soft edged latches and a shared FPU that naturally complement and take advantage of a modern DVFS, although not all features will show up in the early server oriented products. Similarly the Cayman GPU has much more advanced power management than previous GPUs. AMD’s next Fusion product, based on Bulldozer and the Cayman shader cores will undoubtedly resolve many of the existing issues. It is reasonable to expect advances on nearly all fronts – package level management, better static power estimates, finer-grained frequency adjustment and extending dynamic power management to the memory controller. Accordingly, the performance and power benefits will be much greater and also more consistent across workloads for future products.
Discuss (14 comments)