The largest complication for large system builders working with x86 MPUs is coherency. Intel CPUs use a write-invalidate, broadcast based snoop protocol to enforce cache coherency. While this is the simplest method, it also provides the lowest latency for small configurations (4P and below), but does not scale well beyond 4P. In a large system, a broadcast based protocol will consume too much bandwidth for coherency traffic, leaving little room to be used for actual data movement. Almost every larger (8P+) proprietary system relies on the more scalable and elegant directory based cache coherency scheme, but this approach has too much overhead to scale down to smaller system. The X3 strikes a compromise between these two methods and uses a hybrid directory/broadcast mechanism and virtual L4 caches for inter-node traffic, and a snoop filter for intra-node traffic. We will first discuss the snoop filter, then the inter-node coherency mechanisms.
Each scalability controller holds 48Mbits of eDRAM, arranged in 8 banks of 6Mbits each. The entire structure is a 9-way associative, ECC protected, table with 192K rows. Each row has 9 ways of recent cache line requests, and the MESI (Modified Exclusive Shared or Invalid) state of the cache lines. Since each cache line in the Xeon MPU is 2 sectors of 64 bytes each, the entire structure can cache 216MB of data (9 ways x 192K entries x 128 bytes). When there is only a single node, the entire table is used as a snoop filter. The Hurricane chipset has two bus segments, and the snoop filter partitions the bus traffic between the two segments. When a cache miss occurs, a snoop is put on the bus of the originating CPU, the snoop filter intercepts the snoop, and determines if it needs to pass along the snoop to the other bus segment in the quad. If the read request is satisfied with the other processor on the same bus, the snoop filter access is cancelled. If a read request is not satisfied by the other processor on the same bus, results from the snoop filter access determine the next action. If the read request misses the snoop filter, data is returned directly from memory. If the snoop filter indicates the target cache line of the request could exist on the other bus segment, the snoop filter will reflect the snoop across to the other segment. If the other segment still has the cache line, it is routed to the requesting bus segment. If the other segment no longer owns the target cache line, data is returned from memory. Because the protocol is write-invalidate, write requests must always be propogated to any bus segment that has a copy of the cache line in question. Figure 2 below shows the advantages of using a snoop filter for a read request. The snoop filter will probably provide a 10-15% performance boost for a 4P system, compared to using a simple repeater.
Figure 2 – Advantages of a Snoop Filter for a Read Request
Discuss (13 comments)