The fifth presentation in the microprocessor track was from PA Semi describing the techniques used to achieve a low power part with reasonable performance.
The PA6T-1682 is a system on a chip with 25W TDP that features a pair of three way superscalar out-of-order cores operating at 2GHz, a 2MB L2 cache, two integrated DDR2 controllers, and an I/O system connected through a coherent crossbar. The I/O portion of the chip contains two 10GBE MACs, four 1GBE MACs, 8 PCI Express lanes and several coprocessors. A previous article at RWT describes the architecture in much greater detail. The system on a chip is fabricated on a 65nm, triple Vt process with 8 metal layers. The entire design uses 200M transistors, 21M per core, and is 115mm2 with 23,000 clock gates. The device will be packaged in an 1156 ball BGA and has currently sampled to select customers.
Figure 1 – PA6T-1682
The design methodology heavily relied on internally developed standard cells that were optimized for power efficiency. Relatively few custom blocks are used, due to power constraints, and the high speed portions of the chip were done with a structured custom approach. PA also developed an internal tool that estimates power savings for clock gating based on the RTL. As the design moved farther along, commercial tools were used to verify the savings, and correlated well with estimates from the custom tools.
Figure 1 above shows a die micrograph, with different colors for the various voltage domains. Each core has an independent supply and adaptive control. Software specifies the frequency to each processor, and then the voltage is adjusted to the lowest level such that the desired frequency can be obtained. This tuning occurs on a per part basis, and therefore takes into account process variation. Adjusting the frequency based on demand is nothing new, as modern mobile chips have done that for a while; but the simultaneous per-part voltage optimization is novel. The cores can also be shut down, without any problems. The SRAM arrays have their own fixed Vdd supply, because the voltage must stay relatively high, to ensure that writes will function properly. Similarly, the memory and I/O system also have their own fixed Vdd.
Using a dynamic Vdd for each core saves a substantial amount of power, but also creates some headaches for clock distribution. Since the core voltage varies, while the bus voltage is fixed, the clock tree delays may not match up. To solve this problem, hardware tracks phase drift between the core and bus, and then will choose a path which is both synchronous and fixed latency. Over time, the appropriate path may change due to temperature or voltage, and when that happens, the bus will halt to make adjustments.
The memory controller is another large source of power consumption that PA Semi worked on. The scheduler for the controller works to put ranks to sleep based on the performance impact. The ranks with the most outstanding transactions are left on, while others are powered down. In the case of no outstanding transactions, the ranks would be closed relatively quickly. This straight forward optimization saves around 2 watts on multimedia or floating point workloads, while only losing 1-2% performance, an acceptable and attractive trade-off.
Through the use of novel power saving design techniques the PA6T-1682 ultimately is able to achieve a 13W typical, 25W maximum power at 2GHz. Software can lower the frequency to 1.5, 1, 0.5GHz, which reduces power to 16, 12 and 9W respectively. However, the real savings are in the three different sleep modes: doze, nap and sleep. The doze mode stops the core clock, but still snoops on the crossbar and offers immediate transition to an active state, while consuming around 2.5W. Nap and sleep mode go even lower to 2W, but require slight entry and recovery times, as they flush the data caches.