Remote Directory and vL4 Cache
When a system is booted with multiple nodes, the BIOS partitions the associativity of the 48Mbits of eDRAM between the snoop filter and a remote directory (so the split could be 8:1, 7:2, 6:3 etc.). This partitioning is hard-coded into the BIOS, based on the size of the system; a 32P system would obviously have more inter-node traffic than an 8P, and hence should probably have a larger remote directory. It is possible that in the future this partitioning could be user-controlled, but it would likely appeal only to HPC or single application users with a thorough understanding of the scaling characteristics of their applications. The remote directory tracks data that is address mapped to its home memory and is checked out by another node, using the same format as the snoop filter.
When a node requests a cache line that is address mapped to memory on a remote node (after missing local processor caches and the vL4) the originating node will send out a broadcast snoop to all other nodes in the system. The remote directory was cleverly designed to only allow a single node to reply with data to any given snoop broadcast. When a node receives a request from another node and its snoop filter shows ownership of the cache line, data is sent to the requesting node. No more than one node can ever show ownership in its snoop filter. The home node for a particular request is defined as the node where the address of the request maps to memory underneath that node. When an off-node request comes in, the home node will check its remote directory in parallel to its snoop filter. If the remote directory shows that ownership of the cache line has been given to another node, the home node will not return data. If the request hits in the snoop filter or misses in the remote directory, the home node will return data. Figure 3 below demonstrates an example of a memory transaction.
Figure 3 – Remote Directory Behavior in a 4 Quad System
In multi-node systems a virtual L4 cache is used to improve the scalability of the X3. At boot time, the BIOS sets aside a separate pool of memory as a virtual cache. While the amount of memory set aside is hard-coded, it is likely that this could be user-configured, for an appropriately knowledgeable user (again most likely an HPC user). The scalability controller uses the vL4 to store non-local cache lines that have been previously requested (i.e. the cache line’s home is remote relative to the scalability controller). This improves performance because data requests serviced by the vL4 will resolve quicker than an inter-node data request, and ensures memory consistency, because each cache line has a unique location in memory. So in Figure 3 above, the NE quad would send the requested cache line from its vL4 (or the cache itself), rather than from memory.
As noted previously, this protocol was selected strike a balance between the advantages of a directory protocol (scalable latency) and the advantages of a broadcast protocol (low initial latency and cheap). We discussed this a bit further with John Borkenhagen, an IBM Distinguished Engineer who worked on the RS64 IV microprocessor, and is the lead architect of the X3. According to John, simulations showed that the hybrid protocol out performed a directory protocol for 4-quad configurations and below, and matched the directory protocol for 8-quad performance. This seems somewhat remarkable, but it is important to remember that a point-to-point protocol requires knowledge of the destination ahead of time; this introduces an extra look up in the critical path of the request, which a broadcast protocol avoids.