Niagara II Execution Core
While Niagara II is largely a refinement of its predecessor, the changes to the microarchitecture are significant. At the heart of the MPU is a 64 bit, 8 threaded, scalar, in-order processor with a relatively short pipeline and limited speculative execution. Niagara II supports 48 bits of virtual addressing, and 40 bits of physical. Figure 2 below shows a detailed comparison of the cores in Niagara I and II.

Figure 2 – Niagara I and II Cores
The most noticeable Niagara II core changes are doubling the thread count, adding an execution pipe, and integrating a floating point unit. The former improvements are the primary drivers for doubling performance, while the latter will enable Niagara II to handle varied workloads (Niagara I was unable to handle workloads with much more than 1-3% floating point instructions). To accommodate these improvements, the basic pipeline for Niagara II added an additional pipeline stage called “pick” to select up to 2 threads for execution from among the 8 threads.
In designing Niagara II, the architects were extremely careful and economical in their planning, which lead to more complex internal arrangements. As Figure 2 indicates, the 8 threads in Niagara II are actually partitioned into two pipelines and groups, to simplify the design. While the thread grouping is static from the perspective of the hardware, the operating system can migrate threads between groups to ensure fairness. Each thread implements 8 register windows, requiring 160 integer registers (32 global, 64 local and 64 for passing parameters).
The instruction fetch for Niagara II is only slightly modified. Niagara II statically predicts that branches will not be taken, and can speculatively execute past conditional branches with a relatively short 5 cycle mispredict penalty. First, the thread selection logic determines which threads are ready for instruction fetch. Unlike Niagara I, the fetch stage is decoupled from the pick stage. The goal of the instruction fetch is to keep each instruction buffer full, so the fetch selection policy is tailored to that objective. Events such as pipeline dependencies, cache misses and long latency instructions cause threads to go ‘inactive’. Among the active threads, a least recently fetched policy is used to fetch up to 4 instructions from a 32 byte line in the 16KB, 8 way associative L1I cache. The instruction cache also contains a simple prefetcher which can fetch the next sequential cache line.
The instruction fetch is unified, so that a single ported cache can be used. After fetching, the threads are partitioned into two groups, each having its own set of instruction buffers. Each thread group has an instruction selector which picks a single instruction from the four buffers to send to the decoder for execution. The least recently used ‘ready’ thread is picked each cycle with a preference for non-speculative execution. Since the instruction selection is independent, structural hazards (i.e. two instructions trying to use the same resource at once) can be introduced. The decoder detects and resolves structural hazards by delaying one of the contending instructions. A single bit LRU counter is used to alternate which thread group is delayed, to ensure fairness and forward progress. Once decoded, instructions are issued to the functional units.
Each thread group has its own private ALU, which is also used for both address generation and most computation. Almost all instructions are issued directly to the ALU, but floating point and memory operations will flow through to their respective execution units. Each core shares a single FPU and a LSU between all 8 threads. The FPU is fed by a 256 entry 64 bit register file, with 32 registers per thread. The FPU supports Sun’s VIS 2.0 SIMD extensions and is fully pipelined, except for square root and divide (which can execute simultaneous to pipelined FP instructions from another thread) with a 12 stage basic pipeline. The FPU also handles more complex integer instructions such as multiply, divide and population count, while in Niagara I, these were handled by a dedicated ALU. Again, this is an instance of avoiding unnecessary replication; more complex integer instructions are just not common enough to merit dedicated hardware.
The SPU is a cryptographic coprocessor operating at full core frequency. The SPU handles common cryptographic algorithms such as SHA, MD5, AES, DES, etc. It contains a modular arithmetic unit (MAU), a cipher unit and a DMA engine to access memory. The MAU shares the FPU’s multiplier and is used for RSA and binary and integer modular polynomial elliptic curve calculations; staples of encryption workloads. The MAU uses a 160 entry 64 bit scratchpad that can sustain two reads and one write per cycle for storage. The bandwidth of the cipher and hash unit were designed to match Niagara II’s dual 10 gigabit Ethernet controllers, enabling “free encryption”.