By: Mark Roulo (nothanks.delete@this.xxx.com), October 2, 2021 3:59 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on October 2, 2021 3:12 pm wrote:
...snip ...
> Oh, one more thing. Yet another way to solve the problem is not by DMA but by moving the task
> (copy or flood-fill) up to the memory controller. Again much more feasible to the extent that
> the memory controller has access to coherency tags. This allows you to do the job by routing
> from DRAM to controller to DRAM, bypassing NoC and everything else, so lower power.
>
> The ultimate, of course is to do the job purely within the DRAM. Onur Mutlu has published details of how this
> could realistically be added to existing DRAM, but as far as I know no-one has yet done so. Every year we have
> some excitement about PIM around Hot Chips, then it all goes away and another year passes with no actually
> purchasable PIM hardware. Even Apple, as far as we all know, uses vanilla DRAM, and their temporary stake in
> Toshiba Memory was apparently just a bit of financial engineering, not a prelude to bespoke DRAM :-(
You are solving the problem of BULK memcpy.
If code is copying (or zero-ing) 5 - 100 bytes there is a very good chance that the copied or zero-d memory is going to be used immediately (for some values of immediately). Pushing the copy or zero to the DRAM is pretty much the wrong thing to do for either performance or power.
NOTE: Bulk memory zero-ing (as might be useful for managed languages such as Java) might make a lot of sense. You could set up for 'free' memory ahead of time.
Or maybe not. In theory the data could be zero-d as it was read into the caches so the bulk zero-ing in DRAM might be pointless.
...snip ...
> Oh, one more thing. Yet another way to solve the problem is not by DMA but by moving the task
> (copy or flood-fill) up to the memory controller. Again much more feasible to the extent that
> the memory controller has access to coherency tags. This allows you to do the job by routing
> from DRAM to controller to DRAM, bypassing NoC and everything else, so lower power.
>
> The ultimate, of course is to do the job purely within the DRAM. Onur Mutlu has published details of how this
> could realistically be added to existing DRAM, but as far as I know no-one has yet done so. Every year we have
> some excitement about PIM around Hot Chips, then it all goes away and another year passes with no actually
> purchasable PIM hardware. Even Apple, as far as we all know, uses vanilla DRAM, and their temporary stake in
> Toshiba Memory was apparently just a bit of financial engineering, not a prelude to bespoke DRAM :-(
You are solving the problem of BULK memcpy.
If code is copying (or zero-ing) 5 - 100 bytes there is a very good chance that the copied or zero-d memory is going to be used immediately (for some values of immediately). Pushing the copy or zero to the DRAM is pretty much the wrong thing to do for either performance or power.
NOTE: Bulk memory zero-ing (as might be useful for managed languages such as Java) might make a lot of sense. You could set up for 'free' memory ahead of time.
Or maybe not. In theory the data could be zero-d as it was read into the caches so the bulk zero-ing in DRAM might be pointless.