By: anon (anon.delete@this.ymous.org), April 6, 2022 11:42 am
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on April 6, 2022 10:34 am wrote:
> zArchJon (anon.delete@this.anon.com) on April 6, 2022 10:20 am wrote:
> > phoson (me.delete@this.example.org) on April 6, 2022 8:03 am wrote:
> > > https://twitter.com/Underfox3/status/1511697355145367564
> > >
> > > Not even subtle...
> >
> > This is probably just due to the lawyers googling to find diagrams. The meat of
> > the patent has nothing to do with the diagrams in this twitter rant. What is being
> > patented is a method of zeroing a cache line in a multi-processing system.
>
> So what's this about, actually? Some combination of
> - mark a zero line via a tag bit (so you can "read/write" the line
> without actually accessing the SRAM, just synthesizing the values)
> - add to the MOESI/NoC protocol so that you can indicate the movement
> of zero'd lines with address-only (non-data) transactions.
> Both save power and (possibly, if you also want to) add performance.
>
> Apple did this a while ago, https://patents.google.com/patent/US10691610B2,
> (in the current implementation I see it
> - not having any effect on cache effective size
> - probably having a power effect, I have no way to tell
> - not having any bandwidth effect when interacting with SLC/DRAM BUT
> - having a bandwidth effect when interacting with L2, where you can zero
> lines faster than expected because of the address-only transaction)
>
> and I'm sure there are many other ways to do it.
> A different way to slice the problem (with some nice advantages) is Seznick's "Zero-Content Augmented Cache"
> which requires some additional logic but allows you to cover many more zero'd line with that logic.
> The simple (add a bit in the tag) scheme used by Apple (and, for all I know, AMD and now Intel) is an easy
> retrofit, but only really saves you power, and leaves an SRAM line unused. The Seznick scheme is a parallel
> cache (probably best done at L3) that consists of only tags, no data lines, and associates with each tag a
> bitmap of 8 (or 16 or 32 or ...) bits marking the associated lines as zero-only. The lines are assumed to be
> used in similar ways so that MOESI tags can be shared (or fairly simple additional bits added per line). Obviously
> the tag lookup needs to be slightly different given that one tag covers a larger area. The payoff is a small
> amount of extra area that covers most of the common use cases, from zero'd pages to to zeroing data structures
> and arrays, while saving a fair amount of power through reduced SRAM and DRAM accesses.
> Of course you need to ensure that the extra tag accesses do not squander
> the saved SRAM accesses, but overall the scheme looks like a win.
Who is that Seznick you keep mentioning ? I think you mean Dusser et al. ;) https://hal.inria.fr/inria-00374524/document
> zArchJon (anon.delete@this.anon.com) on April 6, 2022 10:20 am wrote:
> > phoson (me.delete@this.example.org) on April 6, 2022 8:03 am wrote:
> > > https://twitter.com/Underfox3/status/1511697355145367564
> > >
> > > Not even subtle...
> >
> > This is probably just due to the lawyers googling to find diagrams. The meat of
> > the patent has nothing to do with the diagrams in this twitter rant. What is being
> > patented is a method of zeroing a cache line in a multi-processing system.
>
> So what's this about, actually? Some combination of
> - mark a zero line via a tag bit (so you can "read/write" the line
> without actually accessing the SRAM, just synthesizing the values)
> - add to the MOESI/NoC protocol so that you can indicate the movement
> of zero'd lines with address-only (non-data) transactions.
> Both save power and (possibly, if you also want to) add performance.
>
> Apple did this a while ago, https://patents.google.com/patent/US10691610B2,
> (in the current implementation I see it
> - not having any effect on cache effective size
> - probably having a power effect, I have no way to tell
> - not having any bandwidth effect when interacting with SLC/DRAM BUT
> - having a bandwidth effect when interacting with L2, where you can zero
> lines faster than expected because of the address-only transaction)
>
> and I'm sure there are many other ways to do it.
> A different way to slice the problem (with some nice advantages) is Seznick's "Zero-Content Augmented Cache"
> which requires some additional logic but allows you to cover many more zero'd line with that logic.
> The simple (add a bit in the tag) scheme used by Apple (and, for all I know, AMD and now Intel) is an easy
> retrofit, but only really saves you power, and leaves an SRAM line unused. The Seznick scheme is a parallel
> cache (probably best done at L3) that consists of only tags, no data lines, and associates with each tag a
> bitmap of 8 (or 16 or 32 or ...) bits marking the associated lines as zero-only. The lines are assumed to be
> used in similar ways so that MOESI tags can be shared (or fairly simple additional bits added per line). Obviously
> the tag lookup needs to be slightly different given that one tag covers a larger area. The payoff is a small
> amount of extra area that covers most of the common use cases, from zero'd pages to to zeroing data structures
> and arrays, while saving a fair amount of power through reduced SRAM and DRAM accesses.
> Of course you need to ensure that the extra tag accesses do not squander
> the saved SRAM accesses, but overall the scheme looks like a win.
Who is that Seznick you keep mentioning ? I think you mean Dusser et al. ;) https://hal.inria.fr/inria-00374524/document