In a forward looking decision by Intel, CSI is fairly agnostic with respect to system expansion. Systems can be expanded in a hierarchical manner, which is the path that IBM took for their older X3 chipset, where one agent in each local cell acts a proxy for the rest of the system. Certainly, the definition of CSI lends itself to hierarchical arrangements, since a “CSI node” is an abstraction and may in fact consist of multiple processors. For instance, in a 16 socket system, there might be four nodes, and each node might contain four sockets and resemble the top diagram in Figure 6. Early Intel patents seem to point to hierarchical expansion as being preferred, although later patents appear to be less restrictive  . As an alternative to hierarchical expansion, larger systems can be built using a flat topology (the 2 dimensional torus used by the EV7 would be an example). However, a flat system must have a node ID for each processor, whereas a hierarchical system needs only enough node IDs for the processors in each ‘cell’. So, while a flat 32 socket system would require 32 distinct node IDs, a comparable system using 8 nodes of 4 sockets would only need 4 distinct node IDs.
Most MPU vendors have used node ID ranges to differentiate between versions of their processors. For instance, Intel and AMD both draw clear distinctions between 1, 2 and 4P server MPUs; each one with an increasing level of RAS and more node IDs and a substantial price increase. Furthermore, a flat system with 8+ processors in all likelihood needs snoop filters or directories for scalability. However, Intel’s x86 MPUs will probably not natively support directories or snoop filters; instead leaving that choice to OEMs. This flexibility for CSI systems means that OEMs with sufficient expertise can differentiate their products with custom node controllers for each local node in a hierarchical system.
Directory based coherency protocols are the most scalable option for system expansion. However, directories use a three hop coherency protocol that is quite different from CSI. In the first phase, the requestor sends a request to the home node, which contains the directory that lists which agents have a copy of the cache line. The home node would then snoop those agents, while sending no messages to uninvolved third parties. Lastly, all the agents receiving a snoop would send a response to the requestor. This presents several problems. The directory itself is difficult to implement, since every cache miss in the system generates both a read (the lookup) and a write to the directory (updating ownership). The latency is also higher than a snooping broadcast protocol, although the system bandwidth used is lower, hence providing better scalability. Snoop filters are a more natural extension of the CSI infrastructure suitable for mid-sized systems.
Snoop filters focus on a subset of the key data to reduce the number of snoop responses. The classic example of a snoop filter, such as Intel’s Blackford, Seaburg or Clarksboro chipsets, tracks remotely cached data. Snoop filters have an advantage because they preserve the low latency of the CSI protocol, while a directory would require changing to a three hop protocol. Not every element in the system must have a snoop filter either; CSI is flexible in that regard as well.