Coherency Leaps Forward at Intel
CSI is a switched fabric and a natural fit for cache coherent non-uniform memory architectures (ccNUMA). However, simply recycling Intel’s existing MESI protocol and grafting it onto a ccNUMA system is far from efficient. The MESI protocol complements Intel’s older bus-based architecture and elegantly enforces coherency. But in a ccNUMA system, the MESI protocol would send many redundant messages between different nodes, often with unnecessarily high latency. In particular, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. However, the requesting processor only needs a single copy of the data, so the system is wasting a bit of bandwidth.
Intel’s solution to this issue is rather elegant. They adapted the standard MESI protocol to include an additional state, the Forwarding (F) state, and changed the role of the Shared (S) state. In the MESIF protocol, only a single instance of a cache line may be in the F state and that instance is the only one that may be duplicated . Other caches may hold the data, but it will be in the shared state and cannot be copied. In other words, the cache line in the F state is used to respond to any read requests, while the S state cache lines are now silent. This makes the line in the F state a first amongst equals, when responding to snoop requests. By designating a single cache line to respond to requests, coherency traffic is substantially reduced when multiple copies of the data exist.
When a cache line in the F state is copied, the F state migrates to the newer copy, while the older one drops back to S. This has two advantages over pinning the F state to the original copy of the cache line. First, because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. In essence, this takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.
Figure 4 demonstrates the advantages of MESIF over the traditional MESI protocol, reducing two responses to a single response (acknowledgements are not shown). Note that a peer node is simply a node in the system that contains a cache.
Figure 4 – MESIF versus MESI Protocol
In general, MESIF is a significant step forward for Intel’s coherency protocol. However, there is at least one optimization which Intel did not pursue – the Owner state that is used in the MOESI protocol (found in the AMD Opteron). The O state is used to share dirty cache lines (i.e. lines that have been written to, where memory has older or dirty data), without writing back to memory.
Specifically, if a dirty cache line is in the M (modified) state, then another processor can request a copy. The dirty cache line switches to the Owned state, and a duplicate copy is made in the S state. As a result, any cache line in the O state must be written back to memory before it can be evicted, and the S state no longer implies that the cache line is clean. In comparison, a system using MESIF or MESI would change the cache line to the F or S state, copy it to the requesting cache and write the data back to memory – the O state avoids the write back, saving some bandwidth. It is unclear why Intel avoided using the O state in the newer coherency protocol for CSI – perhaps the architects decided that the performance gain was too small to justify the additional complexity.
Table 3 summarizes the different protocols and states for the MESI, MOESI and MESIF cache coherency protocols.
Table 3 – Overview of States in Snoop Protocols
Discuss (118 comments)