By: Michael S (already5chosen.delete@this.yahoo.com), October 3, 2021 1:40 am
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on October 2, 2021 7:32 pm wrote:
> Mark Roulo (nothanks.delete@this.xxx.com) on October 2, 2021 3:59 pm wrote:
> > --- (---.delete@this.redheron.com) on October 2, 2021 3:12 pm wrote:
> > ...snip ...
> >
> > > Oh, one more thing. Yet another way to solve the problem is not by DMA but by moving the task
> > > (copy or flood-fill) up to the memory controller. Again much more feasible to the extent that
> > > the memory controller has access to coherency tags. This allows you to do the job by routing
> > > from DRAM to controller to DRAM, bypassing NoC and everything else, so lower power.
> > >
> > > The ultimate, of course is to do the job purely within the
> > > DRAM. Onur Mutlu has published details of how this
> > > could realistically be added to existing DRAM, but as far
> > > as I know no-one has yet done so. Every year we have
> > > some excitement about PIM around Hot Chips, then it all goes away and another year passes with no actually
> > > purchasable PIM hardware. Even Apple, as far as we all
> > > know, uses vanilla DRAM, and their temporary stake in
> > > Toshiba Memory was apparently just a bit of financial engineering, not a prelude to bespoke DRAM :-(
> >
> > You are solving the problem of BULK memcpy.
> >
>
> (a)
>
> Yes indeed. Because that was the subject Linus considered. As in, from my text that you snipped,
> >>> I agree with you primary point, that obsessing over
> super-bulk transfers is not the first thing to worry
> >>> about.
>
>
>
>
> > If code is copying (or zero-ing) 5 - 100 bytes there is a very good chance that the copied or
> > zero-d memory is going to be used immediately (for some values of immediately). Pushing the copy
> > or zero to the DRAM is pretty much the wrong thing to do for either performance or power.
> >
> > NOTE: Bulk memory zero-ing (as might be useful for managed languages such as Java)
> > might make a lot of sense. You could set up for 'free' memory ahead of time.
> >
> > Or maybe not. In theory the data could be zero-d as it was read
> > into the caches so the bulk zero-ing in DRAM might be pointless.
>
>
> (b)
>
> Most of this discussion seems to think that (on aesthetic reasons, nothing else) a single solution only should
> exist; and that that solution should be considered only in light of use by a programming language.
>
> But there are multiple use cases, many of which are outside
> of the domain of a programming language. These include eg
> - wiping a page by the OS (eg for security reasons) AND
> - the copy part of copy-on write of a page
> both of which one may lend themselves to unorthodox mechanisms.
>
>
> OF COURSE most copies (within a language) are small, most such copies probably want the copied
> data present in cache, and such copies are optimally handled either by existing instructions
> (with nice alignment and known sizes) or by *very simple* augmenting instructions.
>
> But if your mind is wandering into the space of "let's do it via DMA", the issue
> is not that that's an empty space, it's that that's a lot less likely a job you
> want to be generated automatically by the compiler using weird instructions;
> rather that's a task that you will call by API.
>
1-page copy or 1-page zeroing is also small.
Nearly all disadvantages of out-of-core copy/set of 1-500 byte apply to 4K. And to 8K if your L1D is > 32KB.
> Mark Roulo (nothanks.delete@this.xxx.com) on October 2, 2021 3:59 pm wrote:
> > --- (---.delete@this.redheron.com) on October 2, 2021 3:12 pm wrote:
> > ...snip ...
> >
> > > Oh, one more thing. Yet another way to solve the problem is not by DMA but by moving the task
> > > (copy or flood-fill) up to the memory controller. Again much more feasible to the extent that
> > > the memory controller has access to coherency tags. This allows you to do the job by routing
> > > from DRAM to controller to DRAM, bypassing NoC and everything else, so lower power.
> > >
> > > The ultimate, of course is to do the job purely within the
> > > DRAM. Onur Mutlu has published details of how this
> > > could realistically be added to existing DRAM, but as far
> > > as I know no-one has yet done so. Every year we have
> > > some excitement about PIM around Hot Chips, then it all goes away and another year passes with no actually
> > > purchasable PIM hardware. Even Apple, as far as we all
> > > know, uses vanilla DRAM, and their temporary stake in
> > > Toshiba Memory was apparently just a bit of financial engineering, not a prelude to bespoke DRAM :-(
> >
> > You are solving the problem of BULK memcpy.
> >
>
> (a)
>
> Yes indeed. Because that was the subject Linus considered. As in, from my text that you snipped,
> >>> I agree with you primary point, that obsessing over
> super-bulk transfers is not the first thing to worry
> >>> about.
>
>
>
>
> > If code is copying (or zero-ing) 5 - 100 bytes there is a very good chance that the copied or
> > zero-d memory is going to be used immediately (for some values of immediately). Pushing the copy
> > or zero to the DRAM is pretty much the wrong thing to do for either performance or power.
> >
> > NOTE: Bulk memory zero-ing (as might be useful for managed languages such as Java)
> > might make a lot of sense. You could set up for 'free' memory ahead of time.
> >
> > Or maybe not. In theory the data could be zero-d as it was read
> > into the caches so the bulk zero-ing in DRAM might be pointless.
>
>
> (b)
>
> Most of this discussion seems to think that (on aesthetic reasons, nothing else) a single solution only should
> exist; and that that solution should be considered only in light of use by a programming language.
>
> But there are multiple use cases, many of which are outside
> of the domain of a programming language. These include eg
> - wiping a page by the OS (eg for security reasons) AND
> - the copy part of copy-on write of a page
> both of which one may lend themselves to unorthodox mechanisms.
>
>
> OF COURSE most copies (within a language) are small, most such copies probably want the copied
> data present in cache, and such copies are optimally handled either by existing instructions
> (with nice alignment and known sizes) or by *very simple* augmenting instructions.
>
> But if your mind is wandering into the space of "let's do it via DMA", the issue
> is not that that's an empty space, it's that that's a lot less likely a job you
> want to be generated automatically by the compiler using weird instructions;
> rather that's a task that you will call by API.
>
1-page copy or 1-page zeroing is also small.
Nearly all disadvantages of out-of-core copy/set of 1-500 byte apply to 4K. And to 8K if your L1D is > 32KB.