One of the key issues in larger systems is effectively handling cache coherency traffic. Since coherency traffic is proportional to the square of the number of processors, this was not an issue for mainstream DP servers until the advent of dual core MPUs. However, now that CMP designs are the norm, and a ‘mainstream DP server’ may have 4 or more processors, it is necessary to address this problem. The Greencreek chipset (which is the workstation variant of Blackford) includes a snoop filter, which was designed to reduce cache coherency traffic.
The snoop filter separates each bus segment into a distinct cache coherency domain, with little traffic occurring between the two. While information on Greencreek’s snoop filter is not publicly available, it is possible to infer how it works based on existing snoop filter implementations, principally the IBM X3 chipset. It is recommended that readers at least skim my discussion of IBM’s snoop filter.
The easiest analogy is that the snoop filter essentially behaves like a switch between the two buses, rather than just a repeater. The snoop filter is most likely implemented as a large table that stores recent cache line requests, the state (MESI) of each cache line, and bits to indicate which segment the cache line is in (or both). When a cache miss occurs, the originating CPU broadcasts a snoop request on its bus. Both the snoop filter and the other CPU in the package will receive the request and take action appropriately. If the read request hits in the snoop filter, then it will check where the requested cache line is located. If the requested cache line is only available on the other bus segment, then the snoop request will be sent to the other segment. If the requested cache line is available on both buses or only on the originating CPU’s bus or only in main memory, then the snoop filter does not pass along the request, thereby saving front side bus bandwidth. Read requests that miss in the snoop filter will probably go to main memory, but may also snoop the other bus segment for good measure. Since Intel’s bus protocol is write-invalidate, write requests must always be propagated to any bus segment that has a copy of the cache line being written.
Without knowing the details of the snoop filter, it is difficult to assess how effective it will be. IBM’s X3 chipset uses a 6MB eDRAM snoop filter that can cache the coherency information for 216MB of data. While eDRAM is very dense, it is rather difficult to manufacture in a logic process. Therefore, it is extremely unlikely that Intel is using eDRAM; given Intel’s manufacturing and design strengths, the designers probably opted for a SRAM implementation. Since SRAM is less dense than eDRAM, it is tough to estimate the size of Greencreek’s snoop filter. The X3’s snoop filter is also used as a remote directory for larger systems, a design requirement not shared by Greencreek. Consequently, Greencreek does not need as large a cache as the X3 does, so 6MB is certainly the upper limit for the size of the snoop filter and in all likelihood the snoop filter is around 3MB. According to IBM estimates, their snoop filter provides a 10-15% performance boost for a 4P system, compared to using a simple repeater. Because there are so many unknowns surrounding Greencreek’s snoop filter, it is hard to arrive at a precise estimate for the impact on performance, but it is reasonable to expect that the snoop filter will provide anywhere from a 5-12% performance boost. One known fact is that the snoop filter is ECC protected, like all good SRAM arrays.
***NB: This page of the article has been updated. I just received information from Intel indicating that Greencreek has a snoop filter, while Blackford does not. Blackford and Greencreek are actually separate ASICs, and the snoop filter is not included in the Blackford ASIC. According to Intel, at this point it is not certain whether Greencreek’s snoop filter will be productized.***