What happens when DRAM has more bandwidth than Layer 3 cache?

By: --- (---.delete@this.redheron.com), December 8, 2022 10:32 am
Room: Moderated Discussions
Etienne (etienne_lorrain.delete@this.yahoo.fr) on December 8, 2022 6:20 am wrote:
> Looks like my AMD Ryzen 9 7950x has a L3 cache bandwidth of 63.9 GB/s, my current DRAM DDR5
> has either 49.6 GB/s (Jedec) or 52.5 GB/s (AMD Expo) measured by memtest86 UEFI.
> It seems some companies are increasing DRAM bandwidth: 8Gbps DDR5.
>
> I assume latency to L3 cache is still probably better than latency to
> DRAM, but in simple terms, do we still need L3 cache in processors?

The question assumes an immutable type of design, that the future will always look like current AMD or Intel. Maybe not...

Look at what Apple does.
(a) Apple does not have L3. Instead we get something like the capacity of L3 with the latency of L2 in the very large L2 caches. This is feasible if you cluster cores to share an L2, something that makes sense for other reasons as well.
Various ARM designs have done this and this is what Intel E-cores do, isn't it?

(b) Apple's equivalent of L3 (kinda sorta) is the SLC which is a MEMORY-SIDE cache. This gives you the capacity of L3 (with the scaling by chip size that you want) but with various benefits in terms of
- supposedly "non-cacheable" data AND
- data transfer between distinct types of objects (IO, GPU, NPU) which can use DMA transfers but which pass through SLC rather than actual DRAM.

BUT (and this is my point) Apple's current SLC right now has the same bandwidth as DRAM...
And yet the SLC is still valuable (and continues to grow...)
Firstly it gives you lower latency.
Secondly it gives you substantially reduced power (energy is saved for every transfer that doesn't have to go off chip).
Third (for Apple in one way for the entire SoC, and I believe for Intel/AMD in a different way [and only for CPUs?]) the L3/SLC acts as a point of coherency, being the point at which various conflicting requests and snoops are resolved and decisions made as to exactly which lines will be moved around in which order.

Even without "caching" per se, this job still has to be done, but again there are multiple ways to do this. Apple use more tags than they have lines, because the extra tags cover the L2s, so that the SLC does not have to be inclusive of L2. Once you have this in place you can then play weird games like using unused L2 [eg part of the GPU or NPU] as extra SLC because the SLC has the tags and controls the data flow...
This is a neat Apple'ism, but the basic concept is generic; you could gave (in theory) an "empty" L3 that consisted of nothing but tags for the lower-level L2's, and whose only job was to act as coherence traffic cop; but you do need *something* performing this task.

TLDR;
- you need an "L3-like" entity to handle coherence
- if it also stores data it gives you latency and power benefits over DRAM regardless of bandwidth

....................

This is all correct as regards current Apple [M1 family and M2].
But it's possible that future Apple will in fact have higher SLC than DRAM bandwidth. As I've said before, future Apple gets a new cache protocol; the chips ALSO get a new interconnect scheme (very recently published patent)
https://patents.google.com/patent/US20220334997A1
and while, in the usual fashion, this is probably *primarily* about saving energy, it may also spill over into increased bandwidth.



Also on the subject of cute very recent Apple patents we have
https://patents.google.com/patent/US20220342588A1
Superficially this looks fairly obvious - use successive bits in an (xor+shift hashed) address to route requests to first one of multiple chiplets (eg M1 Ultra), then to a "slice" within a chip (set of 4 memory controllers) then a "row" (set of two controllers) then a "side" (the specific memory controller. And there's a neat diagram for how this actually plays out after the hashing so that sequential addresses are maximally spread over the memory controllers so as to provide maximal NoC bandwidth.
The first cute aspect of this is that, after each successive routing decision, the address bit that resulted in that decision is dropped from the bus! (eg Once the packet knows that it is going to chiplet 0 rather that chiplet 1, we no longer need to keep hauling that bit around!) OK, sure, it's one bit at a time (ultimately maybe four or five bits) that are dropped out of say 42 or so, but, heck, a few percent here, a few percent there, as long as you keep chipping away at EVERYTHING the chip does EVERYWHERE, it eventually adds up to real power savings!

The second cute aspect is that they have in mind that many times the machine isn't using the entire RAM capacity, or is only accessing it at a low rate, and so they describe a dynamic migration scheme (part OS, part hardware) whereby data can be moved from one memory controller to another (with the VM mapping changed appropriately) so that as many memory controllers as possible can be shut down (either totally, if the RAM is not used; or put into self-refresh until the machine starts accessing those addresses again).
In a sense this is like hot-plugging DRAM, which IBM has had for a while (and big Intel? I don't know.) But the way addresses are spread over multiple memory controllers at a fine granularity makes it, to my eyes, a little more complicated in the details. Presumably IF Apple offer CXL as their way of expanding RAM for the highest end machines like Mac Pro (which is my guess) they will offer DRAM hot plugging as a feature since it falls out of this work anyway.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
What happens when DRAM has more bandwidth than Layer 3 cache?Etienne2022/12/08 06:20 AM
  What happens when DRAM has more bandwidth than Layer 3 cache?Simon Farnsworth2022/12/08 07:22 AM
    bandwidth*delay productMichael S2022/12/08 08:06 AM
    What happens when DRAM has more bandwidth than Layer 3 cache?---2022/12/08 10:44 AM
      What happens when DRAM has more bandwidth than Layer 3 cache?blaine2022/12/08 05:07 PM
  What happens when DRAM has more bandwidth than Layer 3 cache?Michael S2022/12/08 07:32 AM
    What happens when DRAM has more bandwidth than Layer 3 cache?Etienne2022/12/08 08:05 AM
      What happens when DRAM has more bandwidth than Layer 3 cache?Michael S2022/12/08 08:13 AM
        What happens when DRAM has more bandwidth than Layer 3 cache?Etienne2022/12/08 01:56 PM
  What happens when DRAM has more bandwidth than Layer 3 cache?Peter E. fry2022/12/08 08:20 AM
  Programs do not see bandwidth. Programs only see latency. Heikki Kultala2022/12/08 08:26 AM
    Programs do not see bandwidth. Programs only see latency. Chester2022/12/08 11:07 AM
  What happens when DRAM has more bandwidth than Layer 3 cache?Doug S2022/12/08 09:31 AM
  What happens when DRAM has more bandwidth than Layer 3 cache?---2022/12/08 10:32 AM
    What happens when DRAM has more bandwidth than Layer 3 cache?Michael S2022/12/08 02:42 PM
      What happens when DRAM has more bandwidth than Layer 3 cache?---2022/12/08 03:54 PM
        What happens when DRAM has more bandwidth than Layer 3 cache?Anon2022/12/20 06:46 AM
  What happens when DRAM has more bandwidth than Layer 3 cache?Andrey2022/12/08 03:10 PM
    What happens when DRAM has more bandwidth than Layer 3 cache?Etienne2022/12/14 03:20 AM
  What happens when DRAM has more bandwidth than Layer 3 cache?Gionatan Danti2022/12/09 12:31 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? ūüćä