RAS and Power Management
Niagara II is targeted for low power and employs extensive power management features. The first general principle the architects followed was to reduce the power cost of speculation. The microprocessor was designed to only speculate when the outcome was relatively predictable, and also to limit the extent of speculation, and hence the cost of maintaining state and recovering from misspeculation. Some of the previously mentioned examples were different page table walker patterns, static branch prediction and sequential instruction cache line prefetch. Software (operating through the OS and firmware) can also throttle the entire chip, by inserting bubbles in the decoders. Of course, this architectural technique relies on the processor being able to idle efficiently. To that end, many structures in the MPU were clock gated, including many control blocks, data paths and data arrays.
RAS was another key focus area for the Niagara II architects. Generally error rates increase exponentially as the process geometry decreases, which means that as MPUs scale down to 65nm and lower, more and more protection is necessary. Since Sun controls the MPU, OS and firmware, they heavily rely on cooperation between hardware and software to correct and detect errors. The integer and FP register files are ECC protected, along with the store buffer data, trap stack and certain other arrays. Parity is used for the data and instruction cache tags and data, as well as the TLBs, the modular arithmetic scratchpad memory and the store buffer addresses. Errors in the caches are handled by refetching bad data, while other errors are dealt with in software. One of the novel error correction techniques used in Niagara II is dynamic thread and core management. If a thread experiences unusually frequent errors, it can be disabled without any downtime. Since each individual thread contributes relatively little performance, any degradation from offlining a single thread will be minor. If errors still persist, the impacted cores can be offlined in a similar fashion. A floorplan of Niagara II is shown below.

Figure 4 – Niagara II Floorplan
Commentary and Analysis
When assessing Niagara II, the thread partitioning stands out as a novel design decision. Most recent multithreaded designs had 2-4 threads (POWER5, Pentium 4 and Xeon, Itanium 2, EV8, Niagara I), which could be easily handled in a unified manner, so there was no need to group threads together. Since Sun is in new territory, it is hardly surprising that they were forced to use new techniques for scalability. Searching through 8 threads to issue two instructions with no structural hazards would have impacted clockspeed significantly for Niagara II. Architectural simulations revealed that the performance impact of partitioning (and deferred hazard detection in the decode stage) was very small for server workloads, so the design choice was straightforward. Assigning functional units to a specific set of threads creates a certain degree of asymmetry in multithreading, and is also fairly unusual. It will be interesting to see how other participants in the industry plan to handle higher levels of multithreading; although it appears that for now, most other companies will either use fewer than 8 threads, or different types of multithreading. Perhaps just as importantly, this blurring of the architectural lines likely presages future developments in Sun’s upcoming processor code-named Rock.
One of the biggest improvements in Niagara II was the enhanced floating point support. As a general rule of thumb, performance critical floating point applications are rich in ILP, which would make Niagara II a less than ideal processor. However, some workloads simply require a massive amount of bandwidth, and Niagara II is fairly impressive in that regard. Moreover, perhaps this will push Sun into researching techniques to convert ILP into TLP. Certainly, it should be easy to distribute loop iterations (with no carried dependencies) between different threads. More robust techniques along these lines could turn Niagara II into a very attractive HPC system and help the industry as a whole, although the financial merit of such an idea is unclear.
Although performance numbers were not forthcoming, the design objectives seem feasible and relatively competitive for a processor slated to arrive in the third quarter of 2007. The improvements in the cores and system architecture for Niagara II are substantial and should yield a factor of two improvement in performance. If Sun can hit their targets, these goals would translate into ~320K tpmC and ~150K BOPS in SPECjbb2005. This could put Niagara II at performance parity with the competition, and a lead in performance/watt. Either way, it is encouraging to see that Sun will continue to invest in novel architectures.
Acknowledgements
I would like to thank the following individuals for their help in writing this article:
- Greg Grohoski
- Robert Golla
- Alex Plant
- Marc Tremblay
- and of course, anyone else who I may have forgotten.
Discuss (13 comments)