The Remote Data Cache and On-Chip Directory
Rich Oehler, Newisys’ CTO said that the directory and large node cache is like a “belt and suspender” approach to scaling MP performance. Could Newisys release a “belt only” or “suspender only” glue chip?
The object of HORUS, apart from extending the SMP from 8 to 32P, is also to reduce latency for transactions. Remote Data Caching (RDC) helps with caching and the on-chip directory (DIR) helps by removing unnecessary probes, reducing bandwidth usage. It is possible to have only DIR, similarly we can support only RDC, but then we can’t support an exclusive state in RDC if DIR doesn’t exist.
What sort of data will be cached in HORUS’s 64MB cache? Will it contain copies of the L2 caches from Opteron in other quads? Will it be also used as a conventional L3 as well, containing victim cache copies from local/external L2s or will it be inclusive like Intel’s?
HORUS’ RDC will cache data whose home (memory controller) is in remote quads. It will fill the cache both on a read request and on victim writes. So it is more like an inclusive cache.
So are we talking just about the RDC storing only the contents of remote caches, or also remote data in memory too? If it’s just remote caches then you will need 64x1MB remote Opteron to fill it up…
Couldn’t HORUS cache data in remote memory as well?
I think the terminology might be a bit misleading here; the RDC is not exactly an L3 cache. RDC does not cache data whose home (memory controller) is in local quad. It caches data that local CPUs have requested from remote quads.
So that data may be presently cached in the L2 of a remote quad, or in the DRAM of a remote quad?
That’s right. The purpose of the RDC is to ensure that as many of the transactions are completed locally as possible, without going to remote quads to retrieve data. There is little performance benefit if they have to go over remote links to a remote quad. The data that is cached in the RDC is in a shared or exclusive state. When it is in the shared state, that same memory line could be cached in a local Opteron, remote Opteron, or in other remote RDCs.
Editor’s Note: Rajesh wrote the following detailed answer in a continued discussion of the same topic in the Real World Technologies forums:
Physical memory is attached to an Opteron and each “quad” consists of 4 Opteron sockets with each socket having attached memory. Each quad also has one HORUS chip with an RDC and DIR. Up to 8 quads can be connected together using remote links present in the HORUS chip. When CPU 0 in quad A, needs a memory line (Opteron usually requests memory in a memory line or less granularity) that physically resides in say quad B, that memory is considered to be remote. If CPU 0 in quad A, needs a memory line that physically resides in a memory controller attached to CPU 1 in the same quad A, that memory is considered to be local. Memory lines are cached by CPUs (L1 and L2 caches) and the RDC will cache memory lines that reside in physical memory attached to remote quads. The RDC will not cache data that is attached to a local Opteron’s memory controller.
The first time a local CPU makes a request to remote memory. It will miss all the caches (compulsory miss) and the request will have to end up going to the memory controller in the remote quad which will then respond with data. The RDC in HORUS will catch that data and cache it, if the CPU wasn’t requesting the data to be modified. The next time any local CPU makes a request for that data, the RDC will return the data without having to forward the request to remote quads. Also, when the CPU has modified the data and writes back the modified data, RDC will catch it and cache it, just in case any of the local CPUs might want it.
Each HORUS has a DIR, which meanwhile keep track of all the local physical memory (memory lines) that are cached remotely (in remote RDC and caches of remote CPUs) and the state that it is cached in (Modified, Owned, Shared, or Invalid). When a local CPU makes a request to a local memory controller, the local memory controller will issue a broadcast to all caches in the quad to see if any one has a dirty copy. Looking in the DIR, we know if a remote cache has that memory line, which quad has the memory line cached and in what state. Depending on the type of broadcast (which depends on the type of request) and the state of the line in the remote caches, the DIR would not probe any remote quads or probe just the quads that have the data cached hence saving latency and remote bandwidth. The DIR in one quad communicates with RDC in all other quads and vice versa. They work together to try to complete the transactions locally without having to get remote memory as much as possible.
Be the first to discuss this article!