By: --- (---.delete@this.redheron.com), April 6, 2022 10:34 am
Room: Moderated Discussions
zArchJon (anon.delete@this.anon.com) on April 6, 2022 10:20 am wrote:
> phoson (me.delete@this.example.org) on April 6, 2022 8:03 am wrote:
> > https://twitter.com/Underfox3/status/1511697355145367564
> >
> > Not even subtle...
>
> This is probably just due to the lawyers googling to find diagrams. The meat of
> the patent has nothing to do with the diagrams in this twitter rant. What is being
> patented is a method of zeroing a cache line in a multi-processing system.
So what's this about, actually? Some combination of
- mark a zero line via a tag bit (so you can "read/write" the line without actually accessing the SRAM, just synthesizing the values)
- add to the MOESI/NoC protocol so that you can indicate the movement of zero'd lines with address-only (non-data) transactions.
Both save power and (possibly, if you also want to) add performance.
Apple did this a while ago, https://patents.google.com/patent/US10691610B2,
(in the current implementation I see it
- not having any effect on cache effective size
- probably having a power effect, I have no way to tell
- not having any bandwidth effect when interacting with SLC/DRAM BUT
- having a bandwidth effect when interacting with L2, where you can zero lines faster than expected because of the address-only transaction)
and I'm sure there are many other ways to do it.
A different way to slice the problem (with some nice advantages) is Seznick's "Zero-Content Augmented Cache" which requires some additional logic but allows you to cover many more zero'd line with that logic.
The simple (add a bit in the tag) scheme used by Apple (and, for all I know, AMD and now Intel) is an easy retrofit, but only really saves you power, and leaves an SRAM line unused. The Seznick scheme is a parallel cache (probably best done at L3) that consists of only tags, no data lines, and associates with each tag a bitmap of 8 (or 16 or 32 or ...) bits marking the associated lines as zero-only. The lines are assumed to be used in similar ways so that MOESI tags can be shared (or fairly simple additional bits added per line). Obviously the tag lookup needs to be slightly different given that one tag covers a larger area. The payoff is a small amount of extra area that covers most of the common use cases, from zero'd pages to to zeroing data structures and arrays, while saving a fair amount of power through reduced SRAM and DRAM accesses.
Of course you need to ensure that the extra tag accesses do not squander the saved SRAM accesses, but overall the scheme looks like a win.
> phoson (me.delete@this.example.org) on April 6, 2022 8:03 am wrote:
> > https://twitter.com/Underfox3/status/1511697355145367564
> >
> > Not even subtle...
>
> This is probably just due to the lawyers googling to find diagrams. The meat of
> the patent has nothing to do with the diagrams in this twitter rant. What is being
> patented is a method of zeroing a cache line in a multi-processing system.
So what's this about, actually? Some combination of
- mark a zero line via a tag bit (so you can "read/write" the line without actually accessing the SRAM, just synthesizing the values)
- add to the MOESI/NoC protocol so that you can indicate the movement of zero'd lines with address-only (non-data) transactions.
Both save power and (possibly, if you also want to) add performance.
Apple did this a while ago, https://patents.google.com/patent/US10691610B2,
(in the current implementation I see it
- not having any effect on cache effective size
- probably having a power effect, I have no way to tell
- not having any bandwidth effect when interacting with SLC/DRAM BUT
- having a bandwidth effect when interacting with L2, where you can zero lines faster than expected because of the address-only transaction)
and I'm sure there are many other ways to do it.
A different way to slice the problem (with some nice advantages) is Seznick's "Zero-Content Augmented Cache" which requires some additional logic but allows you to cover many more zero'd line with that logic.
The simple (add a bit in the tag) scheme used by Apple (and, for all I know, AMD and now Intel) is an easy retrofit, but only really saves you power, and leaves an SRAM line unused. The Seznick scheme is a parallel cache (probably best done at L3) that consists of only tags, no data lines, and associates with each tag a bitmap of 8 (or 16 or 32 or ...) bits marking the associated lines as zero-only. The lines are assumed to be used in similar ways so that MOESI tags can be shared (or fairly simple additional bits added per line). Obviously the tag lookup needs to be slightly different given that one tag covers a larger area. The payoff is a small amount of extra area that covers most of the common use cases, from zero'd pages to to zeroing data structures and arrays, while saving a fair amount of power through reduced SRAM and DRAM accesses.
Of course you need to ensure that the extra tag accesses do not squander the saved SRAM accesses, but overall the scheme looks like a win.