The last processor presented was Niagara II, Sun’s second throughput computing oriented processor. The architecture for Niagara II was first presented at Hot Chips 18, and previously described here. Niagara I implemented 8 scalar SPARC compatible cores, each supporting four threads, a single shared FPU and four integrated DDR memory controllers all in TI’s 90nm process. Niagara II takes advantage of the denser 65nm process to create a system on a chip with roughly twice the performance. Niagara II augments each core with a FPU pipeline, an integer pipeline and four more threads. At the system level, the device sports 4MB L2 cache, two 10GBE interfaces, wire speed cryptography, a PCI-Express x8 port for storage and 4 FB-DIMM memory controllers. The whole device is 342mm2, and uses 503M transistors in TI’s 65nm bulk process with 11 layers of metallization. The I/O portion of the chip mainly uses SerDes at 1.5V, and the core operates at 1.1V. At 1.1V, the device is targeted at 1.4GHz, with a worst case power draw of 84W.
It appears that in the aftermath of the Millenium project, Sun has really put a lot of emphasis on timely execution and delivery. Niagara II certainly put a heavy emphasis on ‘design for manufacturing’ techniques to increase yields. To avoid project risk and decrease power consumption, a static cell-based methodology was used for most of Niagara II. The only custom circuits were for SRAMs and analog and were proven on test chips prior to first silicon. As with all of the other MPU designs presented, low Vt transistors were used, but only sparingly and in crucial speed paths. Oftentimes, transistors were laid out using larger than minimum design rule, and critical areas were checked using OPC simulations to ensure correctness. Architectural DFM features include support for less than 8 SPARC cores or L2 cache banks; selectively disabling cores/banks on partially flawed dice increases the overall yield.
One of the more challenging areas that the presentation touched on was the clocking across the chip. Since Niagara II is a system on a chip, there are numerous regions of the chip that are running with varying degrees of synchronization.
Figure 3 – Clock Domains in Niagara II
The asynchronous clock crossings are handled by FIFOs that absorb any clock period or skew mismatches. An on-chip PLL generates ratioed synchronous clocks, with a wide range of fractional divisors (2-5.25 in 0.25 increments) to accommodate many of the clock domain crossings. Because the target frequency for Niagara II is relatively low, a less accurate global clock is tolerable. A combination of H-trees and grids were used for clock distribution, compromising between low skew and low power.
The ratioed synchronous clock crossings occur at interfaces between the SPARC cores, crossbar interconnect and other system elements; typically the latter run at a slower clock. Data is transferred between the fast and slow clock domains at the optimal fast clock cycle. Since the clocks are started based on the reference clock, there is a periodic alignment between the rising edges of each clock. An edge detection circuit is responsible for tracking this alignment (which is periodic in nature). It emits an ‘aligned’ signal, which tracks the fast clock latency, when the clocks will be aligned at the destination cluster, and a data transfer is initiated in both directions.
Niagara II incorporates three different high speed, serial I/O technologies: FB-DIMM for memory, PCI-Express and XAUI for 10GBE. These run respectively at 4.8GHz, 2.5GHz and 3.125GHz, and provide 921, 40 and 100Gb/s raw bandwidth respectively, over a terabit per second total. All three interfaces use a common SERDES microarchitecture. To accommodate the slight differences, specifically that FD-DIMM uses Vss signaling (rather than Vdd), a level shifter was employed so that all three SERDES could share the same NMOS-based receivers.
Naturally, a lot of emphasis went into techniques to reduce power consumption for Niagara II. Clocks are gated at both the cluster and local clock-header level. The circuit designers also employed ‘gate-bias’ cells, which have a 10% longer channel, but reduce leakage by 40%. Niagara II also incorporates dynamic power management; the operating system can turn off threads, and a power throttling mode alters the instruction issue rate for the SPARC cores to manage power consumption. This power throttling can reduce consumption by up to 30% at the most aggressive setting, with a suitable workload. Similarly, the memory controllers can throttle access rates, or enter DRAM power-down modes to reduce memory power consumption. Lastly, on-chip thermal diodes monitor the junction temperature, in cases of cooling failures, the operating system can use the various techniques above to ensure continuous (albeit slow) operation. All these factors help to keep power consumption under 84W at worst case, which is fairly remarkable for a high performance server system – it will be interesting to see the resulting server products.