Sandy Bridge has a single shared clock and power domain for the CPU cores, ring interconnect and L3 cache that targets dynamic 0.65V-1.05V operation at up to 3.4GHz (with a turbo boost to 3.8GHz). The GPU runs at up to 1.3GHz, with a separate 0.65V-1.05V dynamic power supply. There are three fixed power supplies for the system agent (0.8V or 0.9V), the DDR I/Os (1.5V) and the PCI-E I/Os (1.05V). There are separate power gates for each core, the GPU and the L3 cache.
In previous designs, such as Nehalem or Westmere, Intel’s L3 cache had a separate power supply that was higher than the core voltage to guarantee stability in the density optimized SRAM cells. The Sandy Bridge design team chose to use the core voltage for the L3 cache to reduce dynamic power. However, the low 0.65V target require new logic design to guarantee correct operation of the L3 cache and various arrays – especially given the increasing leakage and variation at 32nm.
One area that Intel highlighted was post manufacturing configuration to improve circuit reliability. As noted in our discussions of IEDM 2010 and the ISSCC 22nm panel, variation is a critical problem for advanced process technology. Different dice in a wafer might have 4X or more differences in leakage, which is a huge problem for SRAM and register files since high leakage could cause a write error and data corruption.
Sandy Bridge has a 3 transistor, programmable PMOS keeper that adjusts the bit-lines in the register file to account for leakage (the bit line sends data to the individual cells within the register file). After manufacturing, each chip is tested and the leakage is measured. Based on the measured leakage, the keeper is programmed to use the best combination of the 3 PMOS transistors to improve the reliability of writes to the register file. At a high level, this deals with variation by using more area and transistors to measure and adjust circuit behavior for the best reliability, power and performance. Intel claims that by using this and other techniques to deal with variation, they lowered the operating voltage of Sandy Bridge’s L3 cache from 800mV to 650mV. To put this in perspective, the cache’s active power consumption probably dropped by 20-50%.
Another area that Intel discussed was the thermal sensors in Sandy Bridge, which are used for Turbo mode. One of the enhancements in Sandy Bridge’s power controller is that it now dynamically adjusts frequency based on both power consumption and temperature; whereas previous versions relied on changes in power consumption alone. Over short periods of time (under 30 seconds), the microprocessor can actually exceed the TDP if the microprocessor and/or heatsink are relatively cool.
Traditional, diode based thermal sensors are accurate over a wide temperature range, but are very large. Sandy Bridge includes six diode sensors, one for the system agent, GPU and each core; but they are only used for emergency throttling and fan control. Sandy Bridge includes novel CMOS thermal sensors that are much smaller (5100um2), but can only accurately measure temperatures between 80C and 100C. Several of these digital sensors are placed throughout each core and in various hotspots so that the power controller has substantially more accurate information about the microprocessor’s temperature. In turn, these sensors are used by the power controller to determine whether to adjust the clock speed even higher and exceed TDP, and when to return to a lower frequency.
The presentation also dealt with a new feature for testing and validation – the Generic Debug eXternal Connection (GDXC). The GDXC is a debug bus that acts like an on-chip logic analyzer. It can sample and record the traffic that across the ring interconnect, and then dump the data to an external analyzer. This is tremendously helpful for diagnosing problems and necessary as the microprocessor integrates more and more functionality. For instance, it used to be that a system vendor could easily record all the traffic between CPUs and PCI-E graphics cards, to find any errors. The GDXC provides a similar level of visibility for integrated graphics and on-die coherency traffic.
The Sandy Bridge ISSCC presentation was interesting precisely because it is emblematic of a larger trend. The main point of the paper was not about improving Sandy Bridge’s performance (which is a big step forward) by adding AVX, or a uop cache. Rather the emphasis was on efficiency and taking advantage of Moore’s Law at 32nm and beyond.
Intel’s ring interconnect is a good example of design efficiency. A ring topology is not the highest performance option, but it is the simplest and easiest from a validation and design stand point. More importantly, it is easily scalable across different generations and target markets. While this may not be the right choice for architectures which primarily focus on performance (e.g. IBM’s PowerPC), it seems very reasonable for a company that must appeal to a large variety of customers and consumers.
Many of the techniques which Intel highlighted also show how they will spend transistors and area to address variability and thereby reduce power consumption. The programmable keeper tuning in Sandy Bridge’s caches and register files helped the design team deal with some of the nasty probabilistic effects of manufacturing at the 32nm node. Adjusting circuits to cope with variation improves reliability so that the caches and register files can run at a lower voltage with less power. In turn, these power savings can be spent on a more powerful CPU or GPU, or simply used for a more efficient product.
The new digital thermal sensors and Intel’s more aggressive Turbo mode highlight how microprocessors are becoming increasingly self aware and adaptive. Using more thermal sensors, Sandy Bridge is able to make more precise estimates about power and run faster – while being equally safe as the previous generation. Relaxing conservative assumptions is a great way to improve performance and power efficiency, since it is ‘free’ in some sense. In some sense, the new GDXC port in Sandy Bridge is another example of better introspection capabilities; however, the goal is not to improve performance or power efficiency, but to achieve faster time to market.
Ultimately, the physical design and implementation of Sandy Bridge is interesting, not only because it gives insight into Intel’s current products, but also because it is a foreshadowing of what will come from other companies in the future. The problems that Intel faced in 2010 with Sandy Bridge will hit everyone else in late 2011 or 2012. So we can expect similar efforts from AMD, ARM, Nvidia, IBM and Oracle. An on-die logic analyzer would be a very natural fit for a heavily multi-core oriented design like Bulldozer, or future AMD Fusion products. Variation tolerant circuit design techniques would be quite helpful for GPUs with large SRAMs or even low power mobile devices. Of course, Intel is in a unique position as one of the last companies that both designs and manufactures their own chips; the rest of the industry is likely to make subtly different trade-offs that should yield equally interesting technology.