One of the goals for Poulson was creating a core design that could continue to scale across process nodes for the next 10 years. This means changing the circuit design substantially, since techniques that are reasonable for 180nm-90nm are not ideal (and may not even work) in the future. Tailoring the microarchitecture and circuits for frequency, power and reliability at 22nm and beyond was a key aspect of the project.
Poulson’s frequency was not disclosed at ISSCC, but there were certainly hints. The pre-silicon target is 25% faster cycle time than the previous generation. The Tukwila ISSCC papers reported silicon that could run at 2.0GHz with a 170W TDP (see Table 1). However, actual products dissipated 185W with a base and peak frequency of 1.73GHz and 1.86GHz respectively (largely due to excessive power consumption in the L3 cache). That suggests a target frequency of 2.33-2.66GHz for Poulson, although products may be slightly lower.
Table 1 – Poulson and Tukwila Chip Statistics
Two techniques were highlighted for improving Poulson’s frequency. The first is a programmable clock circuit, called a vernier. Normally a clock tree would use simple buffers to send the signal to all the transistors on the chip. Poulson replaces the last stage buffers in the clock tree with programmable buffers that can shift the clock edge forwards or backwards, as needed. Poulson’s verniers have a range of roughly 30ps, which is about a sixteenth of a full clock cycle. These adjustments are made after production and help tolerate variation and speed paths in the silicon. This technique was first used in the 90nm Montecito, and Intel claims that clock tuning raises Poulson’s frequency by 400MHz.
The second trick to improve frequency was placing each pair of cores on a regulated voltage supply. It turns out that transistors within a large die on 32nm vary quite a bit in terms of speed and power consumption. So some cores in a die may be ‘slow and cool’, while others might be ‘fast and hot’. There are four pairs of cores in Poulson, and each pair has a separate power plane and voltage regulation (with a target range of 0.85-1.2V). After manufacturing, Intel adjusts the voltage down on fast cores, and raises the voltage for slow cores – to reach the same target frequency on all eight. Using four separately regulated core voltage supplies increases frequency by 5%, with no additional power consumption.
Power was one of the biggest areas of improvement for Poulson, which doubles the core count and reduces the TDP by 15W. Compared to a hypothetical 32nm Tukwila core, Intel claims that Poulson’s leakage dropped by 30%, idle power was reduce by 70% and active power fell by 60%. Using lower leakage transistors was mainly responsible for the 30% drop in leakage. Poulson’s memory controllers also improved clock gating for DIMMs, to save power at the system level.
Microarchitecture played a big role in active and idle power efficiency. The new replay and flush design is significantly better than the older global stall pipeline. It eliminates all NOPs and previously, a stall would block any instructions from issuing in Tukwila; now each pipeline can make forward progress, even if others are stalled. In a similar vein, fine grained multi-threading will also improve efficiency. In terms of physical design, pervasive clock gating, eliminating dynamic logic and low voltage circuits were critical to reducing power. For example, many register file and SRAM cells were redesigned to avoid multi-ported structures and support low voltage writes.
Poulson also enhanced the dynamic voltage and frequency scaling system (DVFS). The Tukwila DVFS monitored ~120 instruction related events in the cores to estimate capacitance and power consumption. Poulson additionally monitors data patterns and tracks roughly 1800 events. As a result, the system is more accurate and can react to in under 1us compared to 8us for Tukwila. For temperature monitoring, there are 10 diodes spread across the cores and system interface. Each core has a diode at the hot spot and cold spot of the design.
Itanium is intended for mission critical and high availability servers and reliability has always been an integral part of the design. Smaller process technologies are more susceptible to soft errors – so reliability features in a chip must improve merely to keep errors from rising. Intel claims that Poulson reduces the number of errors, despite doubling the cores. The L3 cache ECC is more robust – with a double error correct and triple error detect (DECTED) algorithm rather than conventional SECDED – in part so that it can operate at 0.9-1.1V to save considerable power. Inline ECC has been added to the L2 caches and tags, as well as the directory caches in the home agents. The L1 caches are implemented using more reliable 8T storage cells and parity. Almost all register files also have parity protection; the integer and floating point register files even have ECC. In addition, many latches (temporary storage for data as it flows through the pipeline) have been radiation hardened.
Data storage elements (e.g. SRAM) are the most vulnerable to errors and are carefully designed with reliability in mind. The actual logic in a chip is at lesser risk, but a risk nonetheless. Poulson follows in the footsteps of mainframe processors such as IBM’s z6 with protection for actual logic and computation. There is end-to-end parity on critical buses and data paths, including the L3 ring interconnect. To protect execution, a residue scheme is used on key data paths, including the FP multiplier and adder. In essence, a residue is a modulo checksum for a data value. Computations apply to both the value and residue; if the residue does not match the data value, then an error has occurred. The firmware (in the PAL and SAL) for Poulson takes advantage of many of these new hardware reliability features to detect and log errors using parity or residues; many of these errors would have been fatal previously, but can now be corrected by the firmware.
Discuss (208 comments)