Niagara II Memory, Crossbar and IO
Naturally, when discussing a chip that focuses on memory level parallelism, the most important part is the memory subsystem, principally the Load Store Unit (LSU), L1D cache, the crossbar, the L2 cache and main memory. Figure 3 below compares the memory systems for Niagara I and II.
Figure 3 – Comparison of Niagara I and II Memory Hierarchies
As noted in the previous section, each thread group owns one ALU that also serves as an address generation unit to feed the LSU with requests. The LSU handles a single memory operation each cycle, and the decode stage is responsible for ensuring that no pipeline hazards occur as a result of contention. Niagara I pessimistically deactivated any thread requesting data from caches, assuming that such a request would miss in the L1D cache. One of the changes that improved single threaded performance in Niagara II was to assume that L1D cache requests would hit and keep the requesting thread active (with the appropriate recovery logic of course).
Niagara II maintains up to 4 page tables, each one supporting 8K, 64KB, 4MB or 256MB pages, all of which can be cached by the ITLB and DTLB. Memory address translation for the LSU is handled by the 128 entry, fully associative data translation look-aside buffer. Misses to both the instruction and data TLBs are serviced by a hardware page table walker, which is another new addition to the microarchitecture. The page table walker can search the 4 page tables in three different modes; sequentially, in parallel, or according to a prediction based on the virtual address of the requested data.
The L1D cache itself is a single ported 8KB, 4 way set associative design with write-through to the L2 cache for coherency. Data cache fills can occur in parallel with stores to the L2 cache, enabling a single ported cache which lowers power consumption. The L1D cache is also equipped with a 64 entry store buffer (8 entries per thread) for scalability. The store buffer is drained opportunistically, so that there are fewer delays due to capacity constraints. The L1D cache supports a single outstanding miss per thread (since a cache miss causes a thread to go ‘inactive’), for a total of 8 per core and 64 per device. These cache misses are sent to the crossbar to be filled by the L2 cache or main memory.
All external data accesses by the cores go through the crossbar to reach the rest of the system including the L2 cache, memory and I/O. The crossbar port for each core has a 64 bit outbound lane for requests, and a 128 bit inbound data path. The crossbar port for each core has to satisfy requests from the hardware table walker, the cryptographic units DMA engine and the L1D and L1I caches to the L2 caches, memory and I/O. Like all other shared resources in a multithreaded MPU, there is a fairness algorithm for access to the crossbar that balances the needs of all the different types of requests.
The L2 cache for Niagara II is a total of 4MB, spread across 8 banks. Each bank is 512KB and 16 way set associative, can handle an independent access and has a 128 bit outbound and a 64 bit inbound port on the crossbar. With so many threads in the system, hotspots are a significant concern in a shared resource like the L2 cache. The L2 cache is line interleaved across the 8 banks, which avoids many hot spot problems. One new technique used in Niagara II is software or operating system directed index hashing to disperse data between different sets within a cache to reduce contention or any problems caused by associativity and array size.
The L2 cache also connects to 4 dual channel FB-DIMM controllers, which will probably support 667MHz operation. Two L2 cache banks are paired with a dual channel FB-DIMM controller, so effectively each bank is supported by the full bandwidth of a FB-DIMM channel. An added benefit of this arrangement is that since each memory controller is connected to a pair of cache banks, the cache line interleaving also spreads data around to different memory channels.
The I/O devices are all capable of DMA, but the crossbar is equipped with a port for the cores to read from I/O devices. Niagara II implements two built in 10/1 Gigabit Ethernet ports with packet classification and filtering and a x8 PCI Express port, presumably to be used for storage. By integrating the I/O devices on-die, Niagara II will save a fair amount of power, money and design complexity, compared to systems that use multi chip solutions. Handling 20 gigabits/s of Ethernet traffic is rather remarkable, as a single 10GBE port will overwhelm modern MPUs that do not use TCP/IP coprocessor or offload engines. This is another feat that is only possible because Sun owns the entire stack; hopefully the appropriate hooks are all in place, so that Linux will be able to achieve the same performance. If Sun’s implementation works well, it will set the bar for other processors from server rivals Intel, AMD and IBM.
All together, the crossbar supports 8 data destinations (the SPARC cores) and 9 data sources (8 L2 cache banks, and I/O). Using the rumored 1.4GHz clock speed, that suggests 268.8GB/s of crossbar bandwidth. This is backed by an impressive 42.7GB/s (FBD-667) of memory bandwidth.
One interesting note is that the MPU presented at Hot Chips will not support multiple processors in a system. However, the presenter indicated that there are no technical barriers to multiprocessor systems. Given the rumors of multisocket Niagara II systems in the future, the best explanation is that Sun chose to first focus on an easier to implement, debug and verify single socket version. Perhaps later, one of the ports on the crossbar will be outfitted with Hypertransport or a Sun proprietary interconnect to create larger systems.