It Takes Bricks and Mortar…
RWT: What is the latency for a single hop across the NUMAflex fabric, between two SHUBs?
Jason: Within an Altix C-brick there are physically two separate “nodes” each consisting of two processors, local memory, and a SHUB. Hardware latency to local memory is on the order of 145 nanoseconds. Within the same C-brick to the other node crossing two SHUBS, the remote memory latency increases to 275 nanoseconds. As Altix is expanded to 512 processors we increase latency in a near-uniform fashion. By utilizing a dual-plane fat tree topology we minimize the number router hops and latency induced by the interconnect fabric. Worst case latency in a 512 processor system, for example from processor 1 to processor 512 is 800 nanoseconds under NUMAlink 3, when the NUMAlink 4 router is introduced this worst case number will drop by approximately 19%. By using job placement software to enforce data locality, which is included in our SGI ProPack software, worst case latencies are minimized.
RWT: Are there plans to develop a NUMAlink 5 in the future? What is the time frame for this?
Jason: We have just starting introducing NUMAlink 4 in our systems. The previous generation NUMAlink 3 operated at an aggregate bandwidth of 3.2GB/sec. NUMAlink 4 doubles that aggregate bi-directional bandwidth to 6.4GB/sec. All Altix 3000 system bricks are NUMAlink 3 and NUMAlink 4 enabled, and in fact Altix 3300 and newly introduced Altix 350 which use a ring topology are already utilizing NUMAlink 4.
We are preparing to release the NUMAlink 4 router bricks, which will finalize the introduction of NUMAlink 4 across the entire Altix product line. Currently Altix 3700 systems operate in NUMAlink 3 mode using a dual-plane fat-tree topology that delivers 6.4GB/sec of aggregate bandwidth per brick and an overall system bi-section bandwidth of 400MB/second/processor. When the NUMAlink 4 routers become available these numbers will double and will enable Altix to continue providing leadership performance with the next generations of the Itanium processor.
As far as NUMAlink 5 is concerned, we are just beginning development of that next generation of infrastructure, and it’s really to soon to say when it will be available in products.
RWT: Directory based cache coherency seems to be the best approach to ccNUMA, are there any developments planned to improve or refine the cache coherence in the SHUBs in the future?
Jason: There are really no significant changes planned in the way that cache-coherency is enforced in our architecture. However, we will be extending the size of the cache-coherency domain from the 512-processor limit we have today. In the next generation of the SHUB ASIC we will move to cache-coherence domain of at least 1024 processors.
RWT: Are there plans to support 4GB DIMMs on Altix in the future?
Jason: Certainly we will support denser memory as it becomes available at economically viable prices. We will introduced 2GB DIMMs on Altix which utilize registered ECC DDR memory in April or May.