A Two Hop Protocol for Low Latency
In a CSI system, each node is assigned a unique node ID (NID), which serves as an address on the network fabric. Each node also has a Peer Agent list, which enumerates the other nodes in the system that it must snoop when requesting data from memory (typically peers contain a cache, but could also be an I/O hub or device with DMA). Similarly, each transaction is assigned an identifier (TID) for tracking at each involved node. The TID, together with a destination and source ID form a globally unique transaction identifier . The number of TIDs, and hence outstanding transactions is not unlimited, and will likely be one differentiating factor between Xeon DP, Xeon MP and Itanium systems. Table 4 describes the different fields that can be used in each CSI message, although some messages do not use all fields. For example, a snoop response from a processor that holds data in the shared state will not contain any data, just an acknowledgement.
Table 4 – CSI Message Fields, 
CSI was designed as a natural extension of the existing front side bus protocol; although there are some changes, many of the commands can be easily traced to the commands on the front side bus. A set of commands is listed in the ‘250 patent.
In a three hop protocol, such as the one used by AMD’s Opteron, read requests are first sent to the home node (i.e. where the cache line is stored in memory). The home node then snoops all peer nodes (i.e. caching agents) in the system, and reads from memory. Lastly, all snoop responses from peer nodes and the data in memory are sent to the requesting processor. This transaction involves three point to point messages: requestor to home, home to peer and peer to requestor, and a read from memory before the data can be consumed.
Rather than implement a three hop cache coherency protocol, CSI was designed with a novel two hop protocol that achieves lower latency. In the protocol used by CSI, transactions go through three phases; however, data can be used after the second phase or hop. First, the requesting node sends out snoops to all peer nodes (i.e. caches) and the home node. Each peer node sends a snoop response to the requesting node. When the second phase has finished, the requesting node sends an acknowledgement to the home node, where the transaction is finally completed.
In the rare case of a conflict, the home node is notified and will step in and resolve transactions in the appropriate order to ensure correctness. This could force one or more processor in the system to roll back, replay or otherwise cancel the effects of a load instruction. However, the additional control circuitry is neither frequently used, nor is on any critical paths, so it can be tuned for low leakage power.
In the vast majority of transactions, the home node is a silent observer, and the requestor can use the new data as soon as it is received from the peer agent’s cache, which is the lowest possible latency. In particular, a two hop protocol does not have to wait to access memory in the home node, in contrast to three hop protocols. Figure 5 compares the critical paths between two hop and three hop protocols, when data is in a cache (note that not all snoops and responses are shown – only the critical path).
Figure 5 – Critical Path Latency for Two and Three Hop Protocols
This arrangement is somewhat unusual in that the requesting processor is conceptually pushing transactions into the system and home node. In three hop protocols, the home node acts as a gate keeper and can defer a transaction if the appropriate queues are full, while only stalling the requestor. In a CSI-based system, the home node receives messages after the transaction in is progress or has already occurred. If these incoming transactions were lost, the system would be unable to maintain coherency. Therefore to ensure correctness, CSI home nodes must have a relatively large pre-allocated buffer to support as many transactions as can be reasonably initiated.