Another interesting innovation that may show up in CSI is remote prefetch . Hardware prefetching is nothing new to modern CPUs, it has been around since the 130nm Pentium III. Typically, hardware prefetch works by tracking cache line requests from the CPU and trying to detect a spatial or temporal pattern. For instance, loading a 128MB movie will result in roughly one million sequential requests (temporal) for 128B cache lines that are probably adjacent in memory (spatial). A prefetcher in the cache controller will figure out this pattern, and then start requesting cache accesses ahead of time to hide memory access latencies. However, general purpose systems rely on cache and memory controllers prefetching for the CPU and do not receive feedback from other system agents.
One of the patents relating to CSI is for a remote device initiating a prefetch into processor caches. The general idea is that in some situations, remote agents (an I/O device or coprocessor) might have more knowledge about where data is coming from, than the simple pattern detectors in the cache or memory controller. To take advantage of that, a remote agent sends a prefetch directive message to a cache prefetcher. This message could be as simple as indicating where the prefetch would come from (and therefore where to respond), but in all likelihood would include information such as data size, priority and addressing information. The prefetcher can then respond by initiating a prefetch or simply ignoring the directive altogether. In the former case, the prefetcher would give direct cache access to the remote agent, which then writes the data into the cache. Additionally, the prefetcher could request that the remote agent pre-process the data. For example, if the data is compressed, encoded or encrypted, the remote agent could transform the data to an immediately readable format, or route it over the interprocessor fabric to a decoder or other mechanism.
The most obvious application for remote prefetching is improving I/O performance when receiving data from a high speed Ethernet, FibreChannel or Infiniband interface (the network device would be the remote agent in that case). This would be especially helpful if the transfer unit is large, as is the case for storage protocols such as iSCSI or FibreChannel, since the prefetch would hide latency for most of the data. To see how remote prefetch could improve performance, Figure 7 shows an example using a network device.
Figure 7 – Remote Prefetch for Network Traffic
On the left is a traditional system, which is accessing 4KB of data over a SAN. It receives a packet of data through the network interface, and then issues a write-invalidate snoop for a cache line to all caching agents in the system. A cache line in memory is allocated, and the data is stored through the I/O hub. This repeats until all 4KB of data has been written into the memory, at which point the I/O device issues an interrupt to the processor. Then, the processor requests the data from memory and snoops all the caching agents; lastly it reads the memory into the cache and begins to use the data.
In a system using remote prefetch, the network adapter begins receiving data and the packet headers indicate that the data payload is 4KB. The network adapter then sends a prefetch directive, through the I/O hub to the processor’s cache, which responds by granting direct cache access. The I/O hub will issue a write-invalidate snoop for each cache line written, but instead of storing to memory, the data is placed directly in the processor’s cache in the modified state. When all the data has been moved, the I/O hub sends an interrupt and the processor begins operating on the data already in the cache. Compared to the previously described method, remote prefetching demonstrates several advantages. First, it eliminates all of the snoop requests by the processor to read the data from memory to the cache. Second, it reduces the load on the main memory system (especially if the processors can stream results back to network adapter) and modestly decreases latency.
While the most obvious application of the patented technique is for network I/O, remote prefetch could work with any part of a system that does caching. For instance, in a multiprocessor system, remote prefetch between different processors, or even coprocessors or acceleration devices is quite feasible. It is unclear whether this feature will be made available to coprocessor vendors and other partners, but it would certainly be beneficial for Intel and a promising sign for a more open platform going forward.